Thie1e / cutpointr

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification
https://cran.r-project.org/package=cutpointr
84 stars 13 forks source link

Review / Change bootstrapping routine #12

Closed Thie1e closed 5 years ago

Thie1e commented 5 years ago

Especially with imbalanced data sets that contain a low absolute number of observations of one of the two classes, some bootstrap samples will not contain observations of both classes and the cutpoint optimization cannot be run. There are several ways to deal with that. Currently, cutpointr uses option 1:

  1. If a bootstrap sample contains only one class, redraw until a sample is drawn that contains both classes.
  2. If a bootstrap sample contains only one class, return NA for all results of that bootstrap repetition. We did that in an older version, but it leads to the question how to deal with the results later, e.g. for plotting. Since many results may be missing, the plots of distributions may be misleading (based on a very low number of repetitions). We issued warnings in that case, but the constant warnings are confusing.
  3. Sample with replacement separately from the positive and negative observations (stratified bootstrapping). This has the advantage that in every resample both classes will be contained, however the prevalence (= fraction of positive observations) is constant here.

My impression is that option 3 leads to worse confidence intervals than option 1. The cutpointr:::simple_boot function supports both schemes. To switch to option 3 the argument in simple_boot needs to be set and the code before it is called needs to be edited (some necessary lines for option 3 are currently commented out). A simulation study (different distributions of predictor values, different metrics) to check the coverage probabilities of confidence intervals from options 1 and 3 would be helpful here.

xrobin commented 5 years ago

Just seeing this issue now. If it can help, what we do in pROC by default is option 3. I remember looking into it a long time ago, and the constant balance wasn't a big deal.

We also offer the option to do non-stratified bootstrap and in this case we go with option 2. As it's not the default I assume users will know how to deal with possibly fewer repetitions.

The only disadvantage I can see with option 1 is the speed, as you may have to restart more bootstrap runs later.

Hope it helps!

Thie1e commented 5 years ago

Hi, thanks for the input and sorry for the late answer. We'll keep the default of non-stratified bootstrapping and report the number of missing values (if any) in summary and after the bootstrapping (option 2). This speeds up the bootstrapping when the data is small and unbalanced, because we don't redraw samples. I also preferred to keep the previous behaviour, because changes here are rather invisible to the user. We have an additional boot_stratify argument to switch to option 3.

I'd still be interested in differences between stratified and non-stratified bootstrapping for confidence intervals of metric values, optimal cutpoints, AUC and so on, but probably won't look into it any more. In a quick search I only found this assessment regarding the point estimates from Carolyn M. Rutter, Bootstrap estimation of diagnostic accuracy with patient-clustered data, Academic Radiology, Volume 7, Issue 6 (2000):

"Bootstrap samples were constructed by stratifying patients on overall disease state (any or none) and then drawing patients (ie, the independent units) with replacement from these strata. Resampling patient-level data incorporates all sources of within-patient variability. Stratifying the bootstrap samples by patient-level disease state corresponds to conditioning on true disease state. Disease-state stratification was used to ensure that all the accuracy statistics examined (ie, sensitivity, specificity, AUC) were estimable. Because each of these statistics conditions on disease state, stratified sampling does not bias the point estimates."

So, anyway, we'll keep the non-stratified bootstrap because a change here is invisible to the user and because stratification can be switched on with boot_stratify now.