Closed davharris closed 7 years ago
Hi Dave,
This is great I have now taken this code base and adapted it into both the gaussian and svm functions. Some nice performance improvements though rejection sampling is still faster for 'easy' cases. Also found a unified solution for thresholding the Gaussian hypervolumes - now can be done based on either volume or probability quantiles - see revised hypervolume_threshold function that used to be hypervolume_quantile_threshold. Please give it a spin with a few unit tests and let me know if you run into any problems. New code is version 2.0.5 on github.
I will be away from email until Monday.
Ben
On Wed, May 24, 2017 at 12:39 PM, David J. Harris notifications@github.com wrote:
I've implemented hyper-ellipse sampling based on SVMs, as we discussed. It should be straightforward to use the same code for KDE.
On my machine, it's about 3x faster than hyperboxes in 4 dimensions and 13x faster in 5 dimensions (although part of the difference has to do with memory thrashing and would disappear on a box with more memory).
It also returns the same volume estimate as the current implementation, and (based on visual inspection) the same boundaries.
I'm not especially comfortable with S4, so I haven't integrated this code with the package, but I think it should be relatively straightforward.
Let me know if you have any questions.
devtools::load_all() # Assumes that the package root is the working directory
Functions ---------------------------------------------------------------
sample_ellipsoid = function(center, n, scales) { k = length(center) points = matrix(rnorm(n k), nrow = n) # Start with noise points = points / sqrt(rowSums(points^2)) # Project to sphere surface radii = runif(n, 0)^(1/k) # Each point has a radius, prior to scaling for (i in 1:k) { points[,i] = points[,i] radii * scales[i] + center[i] } points }
ellipsoid_volume = function(scales) { nball_volume(n = length(scales), r = 1) * prod(scales) }
ellipsoid_inverse_weight = function(samples, centers, scales, verbose) {
Returns the number of center-based ellipsoids that contain each sampled point
scale to unit ball
for (i in 1:ncol(centers)) { samples[ , i] = samples[ , i] / scales[i] centers[ , i] = centers[ , i] / scales[i] }
Count the overlap with unit ball
tree <- kdtree_build(centers,verbose=verbose) kdtree_ball_query_multiple(tree, t(samples), nrow(samples), ncol(samples), r = 1, verb = verbose) }
my_svm_hypervolume = function(X, svm.gamma = .5, svm.nu = .01, verbose = TRUE, initial_samples_per_point = 1E5){ svm = e1071::svm(X, y=NULL, type='one-classification', gamma=svm.gamma, nu=svm.nu, scale=TRUE, kernel="radial")
Assuming all of the support vectors were right on top of each other,
how far could we move away from that point before we left the interior
of the hypervolume?
solve b * exp(-g d_2) == r for d_2
into Wolfram Alpha, where* "b" is the sum of the SVM coefficients
* "g" is the SVM kernel bandwidth (gamma)
* "r" is the minimum kernel density of the hypervolume (rho)
* "d_2" is squared distance
squared_scaled_dist = log(sum(svm$coefs) / svm$rho) / svm.gamma scales = apply(X, 2, sd) * sqrt(squared_scaled_dist)
Collect samples from ellipsoids around each support vector
full_samples = do.call(rbind, lapply(svm$index, function(i){ sample_ellipsoid(X[i, ], initial_samples_per_point, scales = scales) }))
Discard samples from outside the hypervolume. This can be done repeatedly
to save memory, if the analysis is run in chunks, as long as a running total
of the number of full_samples is retained.
samples = full_samples[predict(svm, full_samples), , drop = FALSE]
Discard samples from regions that were over-sampled. This step should
only be performed once.
iweight = ellipsoid_inverse_weight(samples, centers = X[svm$index, ], scales = scales, verbose = verbose)
scale weights so that the largest value is 1
weight = 1/(iweight / min(iweight))
included = as.logical(rbinom(length(iweight), size = 1, prob = weight))
Retained samples
samples = samples[included, , drop = FALSE]
This is the average weight of the samples, where samples outside of the
SVM are assigned zero weight.
mean_weight = sum(1/iweight) / nrow(full_samples)
Total volume is the volume of one ellipse, times the number of ellipse,
times the proportion of samples that can be retained.
Volume = ellipsoid_volume(scales) svm$tot.nSV mean_weight
full_samples and svm are only used for plotting below & can probably be
discarded by the package.
list(full_samples = full_samples, samples = samples, Volume = Volume, svm = svm) }
Example analysis --------------------------------------------------------
X = as.matrix(iris[,c(1:2, 2:4)])
Fit both models
hv_time = system.time({ hv = hypervolume_svm(X, samples.per.point = 100) }) my_time = system.time({ hv2 = my_svm_hypervolume(X)})
Plot the hypervolume in green, the unused samples in black, the original data
in blue. The support vectors are the big dots; other data points are X's
with(hv2, { plot(full_samples[1:nrow(samples) %% 100 == 0, ], asp = 1, pch = ".") points(samples, col = "green", pch = ".") points(X, col = "blue", pch = 4) points(X[svm$index, ], col = "blue", pch = 16) })
Acceptance rate
nrow(hv2$samples) / nrow(hv2$full_samples)
Effective rate for ellipsoids (retained samples per second)
c(round(nrow(hv2$samples) / my_time[3]), use.names = FALSE)
Effective rate for boxes (retained samples per second)
c(round(nrow(hv@RandomUniformPointsThresholded) / hv_time[3]), use.names = FALSE)
Should be close to 1.00
hv@Volume / hv2$Volume
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/1, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP9sRoFNXdH1E8bJPV0POYghnvZmPks5r9BbmgaJpZM4Nk-zT .
-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford
Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/
Great.
I've implemented hyper-ellipse sampling based on SVMs, as we discussed. It should be straightforward to use the same code for KDE.
On my machine, it's about 3x faster than hyperboxes in 4 dimensions and 13x faster in 5 dimensions (although part of the difference has to do with memory thrashing and would disappear on a box with more memory).
It also returns the same volume estimate as the current implementation, and (based on visual inspection) the same boundaries.
I'm not especially comfortable with S4, so I haven't integrated this code with the package, but I think it should be relatively straightforward.
Let me know if you have any questions.