Thie1e / cutpointr

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification
https://cran.r-project.org/package=cutpointr
84 stars 13 forks source link

Columns nested in the `boot` column become columns of lists with allowParallel = TRUE and certain seeds #18

Closed xrobin closed 5 years ago

xrobin commented 5 years ago

Consider the following code:

library(doSNOW)
library(dplyr)
library(tidyr)
cl <- makeCluster(2) # 2 cores
registerDoSNOW(cl)

set.seed(101)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
                       silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest %>% head

This outputs a tibble with columns as expected:

# A tibble: 6 x 23
  optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
             <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>           <dbl>
1                2 0.927   0.920            1.76             1.73 0.883   0.833         0.872           0.9  
2                4 0.960   0.894            1.90             1.51 0.940   0.857         0.966           0.632
3                2 0.905   0.956            1.73             1.75 0.850   0.862         0.886           0.889
4                2 0.958   0.881            1.84             1.66 0.874   0.854         0.969           0.8  
5                3 0.927   0.947            1.77             1.68 0.898   0.876         0.867           0.8  
6                2 0.931   0.954            1.76             1.80 0.853   0.873         0.912           0.929
# … with 14 more variables: specificity_b <dbl>, specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>,
#   FP_b <dbl>, TN_b <int>, FN_b <int>, TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>,
#   roc_curve_oob <list>

Now at times with certain seeds:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
                       silent = TRUE, allowParallel = TRUE)
opt_cut_b %>% select(boot) %>% unnest

Notice how the columns are now holding <list>s of <dbl [1]>

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b sensitivity_oob
   <list>           <dbl>   <dbl> <list>          <list>           <lis> <list>  <list>        <list>         
 1 <dbl [1]>        0.945   0.875 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 2 <dbl [1]>        0.959   0.877 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 3 <dbl [1]>        0.937   0.905 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 4 <dbl [1]>        0.927   0.955 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 5 <dbl [1]>        0.965   0.849 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 6 <dbl [1]>        0.909   0.976 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 7 <dbl [1]>        0.960   0.830 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 8 <dbl [1]>        0.908   0.959 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
 9 <dbl [1]>        0.935   0.958 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
10 <dbl [1]>        0.891   0.951 <dbl [1]>       <dbl [1]>        <dbl… <dbl [… <dbl [1]>     <dbl [1]>      
# … with 90 more rows, and 14 more variables: specificity_b <list>, specificity_oob <list>, kappa_b <list>,
#   kappa_oob <list>, TP_b <list>, FP_b <list>, TN_b <list>, FN_b <list>, TP_oob <list>, FP_oob <list>, TN_oob <list>,
#   FN_oob <list>, roc_curve_b <list>, roc_curve_oob <list>

This does not seem to occur with allowParallel = FALSE.

Thie1e commented 5 years ago

Hi, this happens because by default break_ties = c, so if multiple optimal cutpoints are found, all of them are returned. The metric values in e.g. sum_sens_spec_b correspond to the individual optimal cutpoints, so they become list columns, too.

In your example, there are multiple optimal cutpoints in bootstrap repetition 72:

opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 4 2

...and they both lead to the same in-sample metric value, as expected:

opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.759109 1.759109

You can avoid multiple optimal cutpoints by setting break_ties = mean or break_ties = median for example:

set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
                       silent = TRUE, allowParallel = T, break_ties = mean)
opt_cut_b %>% select(boot) %>% unnest

# A tibble: 100 x 23
   optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b
              <dbl> <dbl>   <dbl>           <dbl>            <dbl> <dbl>   <dbl>         <dbl>
 1                2 0.945   0.875            1.79             1.68 0.880   0.833         0.917
 2                2 0.959   0.877            1.83             1.68 0.867   0.874         0.971
 3                2 0.937   0.905            1.76             1.75 0.897   0.829         0.861
 4                2 0.927   0.955            1.72             1.87 0.867   0.873         0.854
 5                2 0.965   0.849            1.83             1.62 0.870   0.841         0.969
 6                2 0.909   0.976            1.73             1.88 0.863   0.890         0.865
 7                2 0.960   0.830            1.81             1.54 0.865   0.868         0.949
 8                2 0.908   0.959            1.73             1.82 0.836   0.897         0.9  
 9                2 0.935   0.958            1.75             1.85 0.874   0.856         0.875
10                2 0.891   0.951            1.70             1.75 0.872   0.836         0.829
# … with 90 more rows, and 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
#   specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>,
#   TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>

Multiple optimal cutpoints can of course be found regardless of allowParallel, e.g. this returns multiple optimal cutpoints when allowParallel = F:

set.seed(222)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
                       silent = TRUE, allowParallel = F)
opt_cut_b %>% select(boot) %>% unnest

The results differ with allowParallel = F and allowParallel = T, however, with the same seed, because the random number generator is different when parallelization is activated. In that case we use the doRNG package for reproducible parallel foreach loops.

xrobin commented 5 years ago

I see. I guess the main problem is that this breaks some of the pipelines described in the README file and the vignette.

For instance the example in "Accessing data, roc_curve, and boot":

set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 20)

will quickly turn into something rather useless with only minor and unpredictable changes:

set.seed(222)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 100)
summary(opt_cut$boot[[1]]$optimal_cutpoint)
       Min. 1st Qu. Median Mean 3rd Qu. Max. IQR Valid NA's
  [1,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [2,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [3,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [4,]    2    2.00    2.0  2.0    2.00    2 0.0     1    0
  [5,]    4    4.00    4.0  4.0    4.00    4 0.0     1    0
  [6,]    3    3.00    3.0  3.0    3.00    3 0.0     1    0
[...]

Maybe that can be fixed by changing the default break_ties to something that actually breaks them as you suggest?

Thie1e commented 5 years ago

This behavior is basically a compromise:

If we break ties per default by mean or median, this will sometimes lead to suboptimal cutpoints. That's the main reason why the default is break_ties = c. For example, with the above code and break_ties = mean we'll get sensitivity + specificity = 1.735 instead of 1.759:

> opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 3
> opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.734818

The drawback is obviously that some columns are not type stable with break_ties = c. I agree that functions like summary don't work well with nested tibbles or list columns out of the box.

By the way, I cannot reproduce the output above. Maybe a difference in package or R versions? For me it looks like this:

summary(opt_cut_b$boot[[1]]$optimal_cutpoint)
[...]
       Length Class  Mode   
[...]
[70,] 1      -none- numeric
[71,] 1      -none- numeric
[72,] 2      -none- numeric
[73,] 1      -none- numeric
[74,] 1      -none- numeric
[75,] 1      -none- numeric
[...]

At least there are no misleading numeric results here.

The question is, if we rather want type stability of all columns or mathematically optimal cutpoints in all scenarios. After thinking about this, the only easily implementable option for the former that I can imagine would be to change the default to break_ties = median (thus still returning the optimal metric value if there is an odd number of optimal cutpoints) and to print a message if multiple optimal cutpoints are found and the function in break_ties is applied.

xrobin commented 5 years ago

I can see the same thing as you after upgrading the following packages:

1: assertthat (0.2.0 -> 0.2.1) [CRAN] 2: cli (1.0.1 -> 1.1.0) [CRAN] 3: colorspace (1.4-0 -> 1.4-1) [CRAN] 4: ggplot2 (3.1.0 -> 3.1.1) [CRAN] 5: glue (1.3.0 -> 1.3.1) [CRAN] 6: gtable (0.2.0 -> 0.3.0) [CRAN] 7: lazyeval (0.2.1 -> 0.2.2) [CRAN] 8: purrr (0.3.0 -> 0.3.2) [CRAN] 9: Rcpp (1.0.0 -> 1.0.1) [CRAN] 10: rlang (0.3.1 -> 0.3.4) [CRAN] 11: stringi (1.3.1 -> 1.4.3) [CRAN] 12: tibble (2.0.1 -> 2.1.1) [CRAN] 13: tidyr (0.8.2 -> 0.8.3) [CRAN]

One of them must have changed something.

I agree the mean or median isn't a good option, it would give you values that don't actually exist. A warning could be nice but will only get you so far. So I guess that's just the intended behavior, nothing to worry about just something weird. Thanks for the help!

Thie1e commented 5 years ago

I tried to check which package might have caused the different output of summary but could not figure it out. I don't think it has to do with any of the above packages. If we look at methods(summary) there's no summary.list that might cause a different output (class(opt_cut$boot[[1]]$optimal_cutpoint) is simply list). The output I have shown comes from summary.default.

Maybe you had loaded a different package that has a summary.list function?

xrobin commented 5 years ago

Hum... I can see a few packages providing summary.list, but none that would've been loaded (or even installed).

Thie1e commented 5 years ago

OK, thanks for looking into this again. I checked and it's not the function from plink. Anyway, there's probably not much we could do if a summary.list function was loaded by the user.