Thie1e / cutpointr

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification
https://cran.r-project.org/package=cutpointr
84 stars 13 forks source link

Update of benchmark data #19

Closed xrobin closed 5 years ago

xrobin commented 5 years ago

Dear Christian,

Thanks for the benchmarks that are performed in the vignette. I've been looking into why pROC is significantly slower, was able to identify and fix some of the bottlenecks. I'm planning to release it around the end of the month and will propose a pull request once it is on CRAN if that's OK with you.

I was wondering if speed was the only reason to exclude pROC with > than 1e5 observations, or if memory was also a factor. In the vignette you write:

OptimalCutpoints and ThresholdROC had to be excluded from benchmarks with more than 1e4 observations and pROC from benchmarks with more than 1e5 observations due to high memory requirements and/or excessive run times

Do you remember if memory was a criteria for pROC and if so what was your criteria exactly, or if it was only a run time reason? I have been able to run the benchmark with 1e7 data points in pROC without any noticable memory issue. Here is what it looks like:

bench_coords bench_roc

PS: in the second plot, you run only the ROCR::prediction function, and not ROCR::performance which is necessary to get the sensitivities and specificities. Is there a reason for that?

Thie1e commented 5 years ago

Hi,

thanks for letting me know. Looking forward to the update to pROC. And pull requests are of course welcome!

I can't recall if it was a memory or speed issue with pROC. I just ran the benchmarks again and indeed pROC is faster now. I don't know why that is - maybe an update to pROC or to R. It's not as fast as in your benchmarks, but I assume you are already using the updated version.

With OptimalCutpoints and 1e5 observations I still get Error: cannot allocate vector of size 37.2 Gb. ThresholdROC finishes but is very slow (several minutes).

You're right that we should let ROCR calculate sensitivity and specificity in the second benchmark. Below is what the results now look like for me. I'll push the new benchmarks to Github and also add a session info. I was planning to update the benchmarks eventually to use the bench package instead of microbenchmark because it also records the total memory allocation.

000009 000005

xrobin commented 5 years ago

I assume you are using pROC 1.14.0 from CRAN. It already has some improvements but the master branch on github is on par with ROCR now. You can try it out with devtools::install_github("xrobin/pROC").

I've made some minor changes in the way pROC is called, especially using the coords function to find the best threshold. That one is very slow in 1.14.0. I'll update the data with the change in ROCR and send a pull request ASAP so you can see what's going on.

I see OptimalCutpoints is trying to allocate a very large vector. Shouldn't be a problem with pROC then. I'd love to see a memory benchmark though, that would be very interesting!

Thie1e commented 5 years ago

Continuing discussion in #20