bblonder / hypervolume

High dimensional geometry operations - (hypervolume R package)
http://cran.r-project.org/web/packages/hypervolume/index.html
22 stars 7 forks source link

hyper-ellipse sampling #1

Closed davharris closed 7 years ago

davharris commented 7 years ago

I've implemented hyper-ellipse sampling based on SVMs, as we discussed. It should be straightforward to use the same code for KDE.

On my machine, it's about 3x faster than hyperboxes in 4 dimensions and 13x faster in 5 dimensions (although part of the difference has to do with memory thrashing and would disappear on a box with more memory).

It also returns the same volume estimate as the current implementation, and (based on visual inspection) the same boundaries.

I'm not especially comfortable with S4, so I haven't integrated this code with the package, but I think it should be relatively straightforward.

Let me know if you have any questions.

devtools::load_all() # Assumes that the package root is the working directory

# Functions ---------------------------------------------------------------

sample_ellipsoid = function(center, n, scales) {
  k = length(center)
  points = matrix(rnorm(n * k), nrow = n)   # Start with noise
  points = points / sqrt(rowSums(points^2)) # Project to sphere surface
  radii = runif(n, 0)^(1/k) # Each point has a radius, prior to scaling
  for (i in 1:k) {
    points[,i] = points[,i] * radii * scales[i] + center[i]
  }
  points
}

ellipsoid_volume = function(scales) {
  nball_volume(n = length(scales), r = 1) * prod(scales)
}

ellipsoid_inverse_weight = function(samples, centers, scales, verbose) {
  # Returns the number of center-based ellipsoids that contain each sampled point

  # scale to unit ball
  for (i in 1:ncol(centers)) {
    samples[ , i] = samples[ , i] / scales[i]
    centers[ , i] = centers[ , i] / scales[i]
  }

  # Count the overlap with unit ball
  tree <- kdtree_build(centers,verbose=verbose)
  kdtree_ball_query_multiple(tree, t(samples), 
                             nrow(samples), ncol(samples), 
                             r = 1, verb = verbose)
}

my_svm_hypervolume = function(X, svm.gamma = .5, svm.nu = .01, verbose = TRUE,
                              initial_samples_per_point = 1E5){
  svm = e1071::svm(X,
                   y=NULL,
                   type='one-classification',
                   gamma=svm.gamma,
                   nu=svm.nu,
                   scale=TRUE,
                   kernel="radial") 

  # Assuming all of the support vectors were right on top of each other,
  # how far could we move away from that point before we left the interior
  # of the hypervolume? 
  # `solve b * exp(-g d_2) == r for d_2` into Wolfram Alpha, where
  #    * "b" is the sum of the SVM coefficients
  #    * "g" is the SVM kernel bandwidth (gamma)
  #    * "r" is the minimum kernel density of the hypervolume (rho)
  #    * "d_2" is squared distance
  squared_scaled_dist = log(sum(svm$coefs) / svm$rho) / svm.gamma
  scales = apply(X, 2, sd) * sqrt(squared_scaled_dist)

  # Collect samples from ellipsoids around each support vector
  full_samples = do.call(rbind,
                         lapply(svm$index,
                                function(i){
                                  sample_ellipsoid(X[i, ], 
                                                   initial_samples_per_point, 
                                                   scales = scales)
                                }))

  # Discard samples from outside the hypervolume. This can be done repeatedly
  # to save memory, if the analysis is run in chunks, as long as a running total 
  # of the number of full_samples is retained.
  samples = full_samples[predict(svm, full_samples), , drop = FALSE]

  # Discard samples from regions that were over-sampled. This step should
  # only be performed once.
  iweight = ellipsoid_inverse_weight(samples, centers = X[svm$index, ], 
                                     scales = scales, verbose = verbose)
  # scale weights so that the largest value is 1
  weight = 1/(iweight / min(iweight))

  included = as.logical(rbinom(length(iweight), size = 1, prob = weight))

  # Retained samples
  samples = samples[included, , drop = FALSE]

  # This is the average weight of the samples, where samples outside of the 
  # SVM are assigned zero weight.
  mean_weight = sum(1/iweight) / nrow(full_samples)

  # Total volume is the volume of one ellipse, times the number of ellipse,
  # times the proportion of samples that can be retained.
  Volume = ellipsoid_volume(scales) * svm$tot.nSV * mean_weight

  # full_samples and svm are only used for plotting below & can probably be 
  # discarded by the package.
  list(full_samples = full_samples, samples = samples, Volume = Volume,
       svm = svm)
}

# Example analysis --------------------------------------------------------

X = as.matrix(iris[,c(1:2, 2:4)])

# Fit both models
hv_time = system.time({
  hv = hypervolume_svm(X, samples.per.point = 100)
})
my_time = system.time({
  hv2 = my_svm_hypervolume(X)})

# Plot the hypervolume in green, the unused samples in black, the original data
# in blue. The support vectors are the big dots; other data points are X's
with(hv2, {
  plot(full_samples[1:nrow(samples) %% 100 == 0, ], asp = 1, pch = ".")
  points(samples, col = "green", pch = ".")
  points(X, col = "blue", pch = 4)
  points(X[svm$index, ], col = "blue", pch = 16)
})

# Acceptance rate
nrow(hv2$samples) / nrow(hv2$full_samples)

# Effective rate for ellipsoids (retained samples per second)
c(round(nrow(hv2$samples) / my_time[3]), use.names = FALSE)

# Effective rate for boxes (retained samples per second)
c(round(nrow(hv@RandomUniformPointsThresholded) / hv_time[3]), use.names = FALSE)

# Should be close to 1.00
hv@Volume / hv2$Volume
bblonder commented 7 years ago

Hi Dave,

This is great I have now taken this code base and adapted it into both the gaussian and svm functions. Some nice performance improvements though rejection sampling is still faster for 'easy' cases. Also found a unified solution for thresholding the Gaussian hypervolumes - now can be done based on either volume or probability quantiles - see revised hypervolume_threshold function that used to be hypervolume_quantile_threshold. Please give it a spin with a few unit tests and let me know if you run into any problems. New code is version 2.0.5 on github.

I will be away from email until Monday.

Ben

On Wed, May 24, 2017 at 12:39 PM, David J. Harris notifications@github.com wrote:

I've implemented hyper-ellipse sampling based on SVMs, as we discussed. It should be straightforward to use the same code for KDE.

On my machine, it's about 3x faster than hyperboxes in 4 dimensions and 13x faster in 5 dimensions (although part of the difference has to do with memory thrashing and would disappear on a box with more memory).

It also returns the same volume estimate as the current implementation, and (based on visual inspection) the same boundaries.

I'm not especially comfortable with S4, so I haven't integrated this code with the package, but I think it should be relatively straightforward.

Let me know if you have any questions.

devtools::load_all() # Assumes that the package root is the working directory

Functions ---------------------------------------------------------------

sample_ellipsoid = function(center, n, scales) { k = length(center) points = matrix(rnorm(n k), nrow = n) # Start with noise points = points / sqrt(rowSums(points^2)) # Project to sphere surface radii = runif(n, 0)^(1/k) # Each point has a radius, prior to scaling for (i in 1:k) { points[,i] = points[,i] radii * scales[i] + center[i] } points }

ellipsoid_volume = function(scales) { nball_volume(n = length(scales), r = 1) * prod(scales) }

ellipsoid_inverse_weight = function(samples, centers, scales, verbose) {

Returns the number of center-based ellipsoids that contain each sampled point

scale to unit ball

for (i in 1:ncol(centers)) { samples[ , i] = samples[ , i] / scales[i] centers[ , i] = centers[ , i] / scales[i] }

Count the overlap with unit ball

tree <- kdtree_build(centers,verbose=verbose) kdtree_ball_query_multiple(tree, t(samples), nrow(samples), ncol(samples), r = 1, verb = verbose) }

my_svm_hypervolume = function(X, svm.gamma = .5, svm.nu = .01, verbose = TRUE, initial_samples_per_point = 1E5){ svm = e1071::svm(X, y=NULL, type='one-classification', gamma=svm.gamma, nu=svm.nu, scale=TRUE, kernel="radial")

Assuming all of the support vectors were right on top of each other,

how far could we move away from that point before we left the interior

of the hypervolume?

solve b * exp(-g d_2) == r for d_2 into Wolfram Alpha, where

* "b" is the sum of the SVM coefficients

* "g" is the SVM kernel bandwidth (gamma)

* "r" is the minimum kernel density of the hypervolume (rho)

* "d_2" is squared distance

squared_scaled_dist = log(sum(svm$coefs) / svm$rho) / svm.gamma scales = apply(X, 2, sd) * sqrt(squared_scaled_dist)

Collect samples from ellipsoids around each support vector

full_samples = do.call(rbind, lapply(svm$index, function(i){ sample_ellipsoid(X[i, ], initial_samples_per_point, scales = scales) }))

Discard samples from outside the hypervolume. This can be done repeatedly

to save memory, if the analysis is run in chunks, as long as a running total

of the number of full_samples is retained.

samples = full_samples[predict(svm, full_samples), , drop = FALSE]

Discard samples from regions that were over-sampled. This step should

only be performed once.

iweight = ellipsoid_inverse_weight(samples, centers = X[svm$index, ], scales = scales, verbose = verbose)

scale weights so that the largest value is 1

weight = 1/(iweight / min(iweight))

included = as.logical(rbinom(length(iweight), size = 1, prob = weight))

Retained samples

samples = samples[included, , drop = FALSE]

This is the average weight of the samples, where samples outside of the

SVM are assigned zero weight.

mean_weight = sum(1/iweight) / nrow(full_samples)

Total volume is the volume of one ellipse, times the number of ellipse,

times the proportion of samples that can be retained.

Volume = ellipsoid_volume(scales) svm$tot.nSV mean_weight

full_samples and svm are only used for plotting below & can probably be

discarded by the package.

list(full_samples = full_samples, samples = samples, Volume = Volume, svm = svm) }

Example analysis --------------------------------------------------------

X = as.matrix(iris[,c(1:2, 2:4)])

Fit both models

hv_time = system.time({ hv = hypervolume_svm(X, samples.per.point = 100) }) my_time = system.time({ hv2 = my_svm_hypervolume(X)})

Plot the hypervolume in green, the unused samples in black, the original data

in blue. The support vectors are the big dots; other data points are X's

with(hv2, { plot(full_samples[1:nrow(samples) %% 100 == 0, ], asp = 1, pch = ".") points(samples, col = "green", pch = ".") points(X, col = "blue", pch = 4) points(X[svm$index, ], col = "blue", pch = 16) })

Acceptance rate

nrow(hv2$samples) / nrow(hv2$full_samples)

Effective rate for ellipsoids (retained samples per second)

c(round(nrow(hv2$samples) / my_time[3]), use.names = FALSE)

Effective rate for boxes (retained samples per second)

c(round(nrow(hv@RandomUniformPointsThresholded) / hv_time[3]), use.names = FALSE)

Should be close to 1.00

hv@Volume / hv2$Volume

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP9sRoFNXdH1E8bJPV0POYghnvZmPks5r9BbmgaJpZM4Nk-zT .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

davharris commented 7 years ago

Great.