Rely on deques or similar

Ironholds commented 7 years ago

The push_back() style operations around vectors are vulnerable to slowdowns and reallocations since vectors are stored in contiguous memory. I'd strongly suggest using (say) deques instead, unless memory is a more important concern than speed. Would you like me to take a stab at implementing it?

davharris commented 7 years ago

@Ironholds: Thanks Oliver! Before you take a stab at it, I think @bblonder and I should figure something out first (below).

@bblonder: I just compared my implementation (#1) versus the one implemented in 2.0.5. Mine spends almost zero time in the kd-tree code, versus 80% for 2.0.5. Any idea what's behind the discrepancy? If we could figure that out, then optimizing the kd-tree code wouldn't be necessary.

davharris commented 7 years ago

@bblonder There are at least two differences between your code and mine.

1) About 99% of the samples that are fed to the kd-ball-checking algorithm are outside of the SVM's hypervolume. That increases the number of samples by ~100x.

2) You're sampling around every data point instead of just around the support vectors. Assuming my radius is reasonable, this is not necessary to encapsulate the hypervolume; it just makes the tree many times bigger. It also increases the rejection rate, which leads to more runs of the KD tree code than would otherwise be needed.

bblonder commented 7 years ago

@davharris,

There is now a 2.0.6 version that goes back to building the elliptical grid using only the SVM points. This dramatically increases speed in the microbenchmarks I tried.

I also changed some point density arguments to improve runtimes in high dimensionalities, addressing point #2. However I do not see how point #1 works out - the same radius is being used in your version and my version. If you have an alternate proposal, please suggest and we can add it.

@Ironholds,

I am open to using different data structures here. Please feel free to propose a re-implementation using deques. All the relevant code is probably in convert.cpp. If you can demonstrate your new solution gives the same outputs and is faster, then I will add it to the package gratefully and acknowledge your efforts.

Ben

On Mon, May 29, 2017 at 9:19 PM, David J. Harris notifications@github.com wrote:

@bblonder https://github.com/bblonder There are at least two differences between your code and mine.

1.

About 99% of the samples that are fed to the kd-ball-checking algorithm are outside of the SVM's hypervolume. That increases the number of samples by ~100x. 2.

You're sampling around every data point instead of just around the support vectors. Assuming my radius is reasonable, this is not necessary to encapsulate the hypervolume; it just makes the tree many times bigger. It also increases the rejection rate, which leads to more runs of the KD tree code than would otherwise be needed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-304725346, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP_OsRwx2zNr1gLoGauvbbtyRmjGQks5r-yhYgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

davharris commented 7 years ago

@bblonder In my code, we:

sample from the hypersphere
throw out 99% of the points because they're outside the SVM boundary
build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

bblonder commented 7 years ago

@davharris,

Yes, that is correct. I made that choice on the principle that evaluating the kdtree more times would be faster than evaluating the model (SVM or KDE) more times. However, open to flipping it around if actually faster. I tried to do this quickly today but can't seem to get the volume estimates to come out right (see sample_model_ellipsoid_dave_broken()). If you have time and want to have a look, please feel free to edit this function.

Ben

On Sun, Jun 4, 2017 at 4:58 PM, David J. Harris notifications@github.com wrote:

@bblonder https://github.com/bblonder In my code, we:

sample from the hypersphere

throw out 99% of the points because they're outside the SVM boundary

build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306066508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP8Kjq68U_C6p4oDkr0k2V8PknEoVks5sAxpZgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

davharris commented 7 years ago

Cool. Don't have time at the moment, but the first thing I'd check is the denominator. You'll want it to equal the number of full_samples, I think. You can double-check my code in case I'm misremembering.

Dave

On Jun 5, 2017, at 10:59, Benjamin Blonder notifications@github.com wrote:

@davharris,

Yes, that is correct. I made that choice on the principle that evaluating the kdtree more times would be faster than evaluating the model (SVM or KDE) more times. However, open to flipping it around if actually faster. I tried to do this quickly today but can't seem to get the volume estimates to come out right (see sample_model_ellipsoid_dave_broken()). If you have time and want to have a look, please feel free to edit this function.

Ben

On Sun, Jun 4, 2017 at 4:58 PM, David J. Harris notifications@github.com wrote:

@bblonder https://github.com/bblonder In my code, we:

sample from the hypersphere

throw out 99% of the points because they're outside the SVM boundary

build the KD tree

If I understand correctly, you're switching steps 2 and 3 relative to my code, which makes your KD tree need to deal with 100x more data points than mine.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306066508, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP8Kjq68U_C6p4oDkr0k2V8PknEoVks5sAxpZgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/ — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Ironholds commented 7 years ago

Should I wait for this discussion to resolve before messing with implementations?

bblonder commented 7 years ago

Hi Oliver,

No, feel free to try a deque implementation now. My discussion with Dave sits at an abstracted level above the problem you are interested in, so they can probably both be solved independently.

Ben

On Mon, Jun 5, 2017 at 10:37 PM, Oliver Keyes notifications@github.com wrote:

Should I wait for this discussion to resolve before messing with implementations?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306365114, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP0PTc1BHKOo4I_sdsD9ydA0mh8Vhks5sBLtMgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

Ironholds commented 7 years ago

Alright, well my machine is a Mac which can no longer locally install hypervolume* without some pain. So this may have to take an L for a while. But I'd thoroughly recommend literally just swapping out std::vector for std::deque and seeing how it plays.

hypervolume requires hitandrun which requires rcdd which requires GNU MP (fine) and GCC, even on a mac (what?!)

bblonder commented 7 years ago

Installing the Developer Tools should include a copy of gcc and let you get started. You might also be able to install a binary version of hitandrun to avoid this issue.

Unfortunately it is not so simple to do this swap-out - the KDTree.h file uses a few calls that are implemented by std::vector and not by std::deque. I don't have time to fix this up myself right now, but here at least is a benchmarking script you can use to see if doing this will help:

library(microbenchmark)

dohv <- function(np, d) { data = rnorm(np*d) data = matrix(data,ncol=d)

mb = microbenchmark(hypervolume_svm(data), hypervolume_gaussian(data),times=2) return(mb) }

params <- expand.grid(np=c(100,500,1000),d=c(2,3,4,5))

results <- lapply(1:nrow(params), function(i) { dohv(params$np[i], params$d[i]) })

results_bound = cbind(params, t(sapply(results, function(x) {tapply(x$time,x$expr, mean)}))) names(results_bound) <- c("np","d","svm","gaussian") write.csv(results_bound,row.names=F,file='~/Desktop/hv_vector.csv')

Still want to try looking into this?

Ben

On Wed, Jun 7, 2017 at 8:52 PM, Oliver Keyes notifications@github.com wrote:

Alright, well my machine is a Mac which can no longer locally install hypervolume* without some pain. So this may have to take an L for a while. But I'd thoroughly recommend literally just swapping out std::vector for std::deque and seeing how it plays.

hypervolume requires hitandrun which requires rcdd which requires GNU MP (fine) and GCC, even on a mac (what?!)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bblonder/hypervolume/issues/2#issuecomment-306966776, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnvP7tw9uWvrm89To36S_X-Co0BU0rkks5sB0XIgaJpZM4Noi-P .

-- Benjamin Blonder Environmental Change Institute School of Geography and the Environment University of Oxford

Website + photoblog: http://www.benjaminblonder.org Google Scholar: http://scholar.google.com/citations?user=l3FNoE8AAAAJ University of Arizona Sky School: https://skyschool.arizona.edu/

bblonder / hypervolume

Rely on deques or similar #2