evalclass / precrec

An R library for accurate and fast calculations of Precision-Recall and ROC curves
https://evalclass.github.io/precrec
GNU General Public License v3.0
45 stars 5 forks source link

Option for reporting less data points #1

Closed mehrankr closed 7 years ago

mehrankr commented 7 years ago

Thanks a lot for this very useful and well documented package.

I've noticed that the number of data points for either ROC or PRC is the same as input vectors. This slows plotting when datasets are large.

I think by default, evalmod should only consider unique values in "scores" argument. In addition, an option to decrease or increase the resolution might be helpful.

Thanks

takayasaito commented 7 years ago

Thank you very much for your suggestions.

  1. Using unique scores It is important to treat tied scores properly to calculate accurate ROC and precision-recall curves. For instance, the evalmod function provides the ties_method option to decide how to treat them. Therefore, considering only unique scores is not a good approach to solve this slow plotting issue since the calculated curves will be inaccurate.

  2. Changing resolution It is most likely an effective approach to make the plotting speed faster by trimming supporting points to a certain resolution. It is definitely feasible to enhance the package in this way, but I still need to look into it.

  3. Alternative solution It is often faster to make plots with plot instead of ggplot. I made a test script and tested on three different environments. In this test scenario, both plot and ggplot are similar when tested on OSX, but plot is much faster than ggplot on Linux and Windows.

# Test code
library(precrec)
library(ggplot2)
samp1 <- create_sim_samples(5, 50000, 50000)
eval1 <- evalmod(scores = samp1$scores, labels = samp1$labels)
system.time(autoplot(eval1))
system.time(plot(eval1))
# Linux - i7, 3.4GHz, 16 GB
> system.time(autoplot(eval1))
   user  system elapsed 
  8.169   0.079   8.489 
> system.time(plot(eval1))
   user  system elapsed 
  0.681   0.015   0.699
# Windows - AMD A4, 1.8 GHz, 4 GB
> system.time(autoplot(eval1))
   user  system elapsed 
  31.09    6.94   45.56 
> system.time(plot(eval1))
   user  system elapsed 
  11.94    6.96   19.48 
# OSX - i5, 2.4 GHz, 4 GB
> system.time(autoplot(eval1))
   user  system elapsed 
 13.369   1.769  15.935 
> system.time(plot(eval1))
   user  system elapsed 
 14.090   0.268  14.516 
takayasaito commented 7 years ago

I updated autoplot to reduce supporting points according to x_bins of the evalmod function. The points are reduced for ggplot2 by default.

I'll include this update in v0.7.0.

# Test code
library(precrec)
library(ggplot2)
samp1 <- create_sim_samples(5, 50000, 50000)
eval1 <- evalmod(scores = samp1$scores, labels = samp1$labels)
system.time(autoplot(eval1))
system.time(autoplot(eval1, reduce_points = FALSE))
# Linux - i7, 3.4GHz, 16 GB
> system.time(autoplot(eval1))
   user  system elapsed 
  0.594   0.000   0.626 
> system.time(autoplot(eval1, reduce_points = FALSE))
   user  system elapsed 
  8.496   0.000   8.520