multiple cutpoints - Githubissues

jwijffels commented 7 years ago

Hi, I would like to know if you have any plans to make a function to return multiple cutpoints instead of one cutpoint? What would your approach be in selecting multiple cutpoints?

jwijffels commented 7 years ago

What I meant is that if you have a predictor y, that it finds several cutpoints alongside y to make a good classification of the outcome. Instead of having only one cutpoint.

Thie1e commented 7 years ago

I see. We should distinguish multiple "optimal" and multiple "good" cutpoints.

In the case of multiple optimal cutpoints cutpointr currently issues a warning. Issuing that warning is actually handled by the method function. Specifically, only the minimize_metric and maximize_metric functions issue such warnings. The oc_OptimalCutpoints wrapper has a break ties argument so that for example the mean of the optimal cutpoints is returned (default). oc_youden_kernel and oc_normal don't lead to multiple optimal cutpoints.

Possible ways of handling multiple optimal cutpoints are:

Handle multiple optimal cutpoints in cutpointr so that method functions can return multiple optimal cutpoints. Then cutpointr would need an additional break_ties argument or similar to automatically select one of those cutpoints. Also, I'm not sure how to best return multiple cutpoints. If cutpointr selects one optimal cutpoint among the optimal ones, the output would need to be augmented by one column that includes a vector of optimal cutpoints in its elements. In any case, I'd like to keep cutpointr always return a single number in the optimal_cutpoint column.
As above, so let cutpointr handle multiple returned cutpoints but "throw away" alternative optimal cutpoints and only break the ties.
Augment maximize_metric and minimize_metric by a break_ties argument to select a function for handling multiple optimal cutpoints (as in oc_OptimalCutpoints)
Don't change anything. That is, let the minimize_metric and maximize_metric functions handle multiple optimal cutpoints. The idea is, that the method functions could be used separately without the cutpointr function.

When breaking ties using mean or median, note that the returned "optimal" cutpoint may not actually be optimal. In other words, that cutpoint may lead to a metric value that is below the optimal one. That is why maximize_metric and minimize_metric return the minimum or maximum of the optimal cutpoints. In general, we don't regard this issue as particularly important in the real world. I'd like to hear opinions, though.

Concerning the handling of other "good" cutpoints: Currently we don't have plans to return these. Is there much demand for that? I assume the idea is to search for an optimal cutpoint using maximize_metric or minimize_metric and then get the additional, say, 5 next best cutpoints in an additional tibble column along with the corresponding metric value.

We'd probably need at least one additional argument to do that:

How many additional cutpoints
Or alternatively all cutpoints that perform within 90% of the chosen metric or something like that. Unfortunately, in the case of a continuous (not integer) predictor variable and many observations this could return a large number of additional cutpoints.

I'll leave this issue open for a while and I'm interested in other opinions. Thanks everyone.

Thie1e commented 6 years ago

With version 0.6.0 cutpointr was enhanced by the option to return multiple optimal cutpoints. The new break_ties argument specifies if all optimal cutpoints should be returned or if they should be summarized, e.g. using mean or median. If break_ties = c all optimal cutpoints will be returned and the optimal_cutpoint column becomes a list.

> dat <- data.frame(y = c(0,0,0,1,0,1,1,1), x = 1:8)
> cutpointr(dat, x = x, class = y, break_ties = c, pos_class = 1, direction = ">=")
Multiple optimal cutpoints found
# A tibble: 1 x 15
  direction optimal_cutpoint method          sum_sens_spec acc      
  <chr>     <list>           <chr>                   <dbl> <list>   
1 >=        <dbl [2]>        maximize_metric          1.75 <dbl [2]>
  sensitivity specificity   AUC pos_class neg_class prevalence outcome
  <list>      <list>      <dbl>     <dbl>     <dbl>      <dbl> <chr>  
1 <dbl [2]>   <dbl [2]>   0.938      1.00         0      0.500 y      
  predictor data             roc_curve            
  <chr>     <list>           <list>               
1 x         <tibble [8 × 2]> <data.frame [9 × 10]>

jwijffels commented 6 years ago

Thank you. I'll try it out!

jgarces02 commented 4 years ago

Hi @Thie1e,

Sorry for continuing this issue but I don't know if always that break_ties = c should to appear multiple cutpoints... in my case only it's only appearing one.

cutpointr(data = dff2, x = var, class = c_PFS, metric = accuracy,
          method = maximize_boot_metric, summary_func = median,
          boot_cut = 100, boot_stratify = T, boot_runs = 100,
          break_ties = c)

# A tibble: 1 x 16
  direction optimal_cutpoint method               accuracy      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
  <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
1 >=                 0.85742 maximize_boot_metric 0.683735 0.683735    0.138889    0.946429 0.585627 1         0           0.325301 c_PFS  
  predictor data               roc_curve          boot 
  <chr>     <list>             <list>             <lgl>
1 var<tibble [332 x 2]> <tibble [208 x 9]> NA

Thanks for your help!

Thie1e commented 4 years ago

Hi,

with method = maximize_boot_metric you won't get multiple optimal cutpoints, because the returned optimal cutpoint is (in the above example) the median of all optimal cutpoints that were calculated in the 100 (= boot_cut) bootstrap samples.

There may have been multiple optmal cutpoints in some of the bootstrap samples, but these just contribute to the median.

I'm rather wondering why the boot column is NA in the output, because it should be a tibble with 100 rows as specified by boot_runs. Does the above call really return an NA there? If so, can you post the data somewhere? Running a similar call on my machine returns the bootstrap data correctly.

jgarces02 commented 4 years ago

Yes, you're right, that's a bit illogical get all cutpoints from a boot, totally agree (sorry for so stupid question).

Regarding yours... no, none NA was returned. And indeed, I don't know what happens, but I ran again the code and boot column was right formed (:dizzy_face:):

> cutpointr(data = dff2, x = var, class = c_PFS, metric = accuracy,
+           method = maximize_boot_metric, summary_func = median,
+           boot_cut = 100, boot_stratify = T, boot_runs = 100,
+           break_ties = c)
Assuming the positive class is 1
Assuming the positive class has higher x values
Running bootstrap...
# A tibble: 1 x 16
  direction optimal_cutpoint method               accuracy      acc sensitivity specificity      AUC pos_class neg_class prevalence outcome
  <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>       <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>  
1 >=                    0.78 maximize_boot_metric 0.704735 0.704735    0.155963       0.944 0.623431 1         0           0.303621 c_PFS  
  predictor data               roc_curve          boot               
  <chr>     <list>             <list>             <list>             
1 pctCTCs   <tibble [359 x 2]> <tibble [209 x 9]> <tibble [100 x 23]>

So, thanks anyway for your help!

Thie1e commented 4 years ago

OK, glad to hear that. And don't worry, it's not a stupid question.

It makes a subtle difference, because break_ties still applies to the individual cutpoints of every bootstrap repetition, so the bootstrapped cutpoint may differ depending on break_ties, even if a seed was set. I just don't think that it makes a substantial difference, especially if boot_cut is large enough.

Thie1e / cutpointr

multiple cutpoints #2