Closed xrobin closed 5 years ago
Hi, this happens because by default break_ties = c
, so if multiple optimal cutpoints are found, all of them are returned. The metric values in e.g. sum_sens_spec_b
correspond to the individual optimal cutpoints, so they become list columns, too.
In your example, there are multiple optimal cutpoints in bootstrap repetition 72:
opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 4 2
...and they both lead to the same in-sample metric value, as expected:
opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.759109 1.759109
You can avoid multiple optimal cutpoints by setting break_ties = mean
or break_ties = median
for example:
set.seed(102)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 100,
silent = TRUE, allowParallel = T, break_ties = mean)
opt_cut_b %>% select(boot) %>% unnest
# A tibble: 100 x 23
optimal_cutpoint AUC_b AUC_oob sum_sens_spec_b sum_sens_spec_o… acc_b acc_oob sensitivity_b
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 0.945 0.875 1.79 1.68 0.880 0.833 0.917
2 2 0.959 0.877 1.83 1.68 0.867 0.874 0.971
3 2 0.937 0.905 1.76 1.75 0.897 0.829 0.861
4 2 0.927 0.955 1.72 1.87 0.867 0.873 0.854
5 2 0.965 0.849 1.83 1.62 0.870 0.841 0.969
6 2 0.909 0.976 1.73 1.88 0.863 0.890 0.865
7 2 0.960 0.830 1.81 1.54 0.865 0.868 0.949
8 2 0.908 0.959 1.73 1.82 0.836 0.897 0.9
9 2 0.935 0.958 1.75 1.85 0.874 0.856 0.875
10 2 0.891 0.951 1.70 1.75 0.872 0.836 0.829
# … with 90 more rows, and 15 more variables: sensitivity_oob <dbl>, specificity_b <dbl>,
# specificity_oob <dbl>, kappa_b <dbl>, kappa_oob <dbl>, TP_b <dbl>, FP_b <dbl>, TN_b <int>, FN_b <int>,
# TP_oob <dbl>, FP_oob <dbl>, TN_oob <int>, FN_oob <int>, roc_curve_b <list>, roc_curve_oob <list>
Multiple optimal cutpoints can of course be found regardless of allowParallel
, e.g. this returns multiple optimal cutpoints when allowParallel = F
:
set.seed(222)
opt_cut_b <- cutpointr(suicide, dsi, suicide, boot_runs = 1000,
silent = TRUE, allowParallel = F)
opt_cut_b %>% select(boot) %>% unnest
The results differ with allowParallel = F
and allowParallel = T
, however, with the same seed, because the random number generator is different when parallelization is activated. In that case we use the doRNG
package for reproducible parallel foreach
loops.
I see. I guess the main problem is that this breaks some of the pipelines described in the README file and the vignette.
For instance the example in "Accessing data, roc_curve, and boot":
set.seed(123)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 20)
will quickly turn into something rather useless with only minor and unpredictable changes:
set.seed(222)
opt_cut <- cutpointr(suicide, dsi, suicide, boot_runs = 100)
summary(opt_cut$boot[[1]]$optimal_cutpoint)
Min. 1st Qu. Median Mean 3rd Qu. Max. IQR Valid NA's
[1,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[2,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[3,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[4,] 2 2.00 2.0 2.0 2.00 2 0.0 1 0
[5,] 4 4.00 4.0 4.0 4.00 4 0.0 1 0
[6,] 3 3.00 3.0 3.0 3.00 3 0.0 1 0
[...]
Maybe that can be fixed by changing the default break_ties
to something that actually breaks them as you suggest?
This behavior is basically a compromise:
If we break ties per default by mean
or median
, this will sometimes lead to suboptimal cutpoints. That's the main reason why the default is break_ties = c
. For example, with the above code and break_ties = mean
we'll get sensitivity + specificity = 1.735 instead of 1.759:
> opt_cut_b$boot[[1]]$optimal_cutpoint[[72]]
[1] 3
> opt_cut_b$boot[[1]]$sum_sens_spec_b[[72]]
[1] 1.734818
The drawback is obviously that some columns are not type stable with break_ties = c
. I agree that functions like summary
don't work well with nested tibbles or list columns out of the box.
By the way, I cannot reproduce the output above. Maybe a difference in package or R versions? For me it looks like this:
summary(opt_cut_b$boot[[1]]$optimal_cutpoint)
[...]
Length Class Mode
[...]
[70,] 1 -none- numeric
[71,] 1 -none- numeric
[72,] 2 -none- numeric
[73,] 1 -none- numeric
[74,] 1 -none- numeric
[75,] 1 -none- numeric
[...]
At least there are no misleading numeric results here.
The question is, if we rather want type stability of all columns or mathematically optimal cutpoints in all scenarios. After thinking about this, the only easily implementable option for the former that I can imagine would be to change the default to break_ties = median
(thus still returning the optimal metric value if there is an odd number of optimal cutpoints) and to print a message if multiple optimal cutpoints are found and the function in break_ties
is applied.
I can see the same thing as you after upgrading the following packages:
1: assertthat (0.2.0 -> 0.2.1) [CRAN] 2: cli (1.0.1 -> 1.1.0) [CRAN] 3: colorspace (1.4-0 -> 1.4-1) [CRAN] 4: ggplot2 (3.1.0 -> 3.1.1) [CRAN] 5: glue (1.3.0 -> 1.3.1) [CRAN] 6: gtable (0.2.0 -> 0.3.0) [CRAN] 7: lazyeval (0.2.1 -> 0.2.2) [CRAN] 8: purrr (0.3.0 -> 0.3.2) [CRAN] 9: Rcpp (1.0.0 -> 1.0.1) [CRAN] 10: rlang (0.3.1 -> 0.3.4) [CRAN] 11: stringi (1.3.1 -> 1.4.3) [CRAN] 12: tibble (2.0.1 -> 2.1.1) [CRAN] 13: tidyr (0.8.2 -> 0.8.3) [CRAN]
One of them must have changed something.
I agree the mean or median isn't a good option, it would give you values that don't actually exist. A warning could be nice but will only get you so far. So I guess that's just the intended behavior, nothing to worry about just something weird. Thanks for the help!
I tried to check which package might have caused the different output of summary
but could not figure it out. I don't think it has to do with any of the above packages. If we look at methods(summary)
there's no summary.list
that might cause a different output (class(opt_cut$boot[[1]]$optimal_cutpoint)
is simply list
). The output I have shown comes from summary.default
.
Maybe you had loaded a different package that has a summary.list
function?
Hum... I can see a few packages providing summary.list, but none that would've been loaded (or even installed).
OK, thanks for looking into this again. I checked and it's not the function from plink
. Anyway, there's probably not much we could do if a summary.list
function was loaded by the user.
Consider the following code:
This outputs a tibble with columns as expected:
Now at times with certain seeds:
Notice how the columns are now holding
<list>
s of<dbl [1]>
This does not seem to occur with
allowParallel = FALSE
.