Closed jwijffels closed 4 years ago
What I meant is that if you have a predictor y, that it finds several cutpoints alongside y to make a good classification of the outcome. Instead of having only one cutpoint.
I see. We should distinguish multiple "optimal" and multiple "good" cutpoints.
In the case of multiple optimal cutpoints cutpointr currently issues a warning. Issuing that warning is actually handled by the method
function. Specifically, only the minimize_metric
and maximize_metric
functions issue such warnings.
The oc_OptimalCutpoints
wrapper has a break ties argument so that for example the mean of the optimal cutpoints is returned (default). oc_youden_kernel
and oc_normal
don't lead to multiple optimal cutpoints.
Possible ways of handling multiple optimal cutpoints are:
cutpointr
so that method
functions can return multiple optimal cutpoints. Then cutpointr
would need an additional break_ties
argument or similar to automatically select one of those cutpoints. Also, I'm not sure how to best return multiple cutpoints. If cutpointr
selects one optimal cutpoint among the optimal ones, the output would need to be augmented by one column that includes a vector of optimal cutpoints in its elements. In any case, I'd like to keep cutpointr
always return a single number in the optimal_cutpoint
column.cutpointr
handle multiple returned cutpoints but "throw away" alternative optimal cutpoints and only break the ties.maximize_metric
and minimize_metric
by a break_ties
argument to select a function for handling multiple optimal cutpoints (as in oc_OptimalCutpoints
)minimize_metric
and maximize_metric
functions handle multiple optimal cutpoints. The idea is, that the method
functions could be used separately without the cutpointr
function.When breaking ties using mean or median, note that the returned "optimal" cutpoint may not actually be optimal. In other words, that cutpoint may lead to a metric value that is below the optimal one. That is why maximize_metric
and minimize_metric
return the minimum or maximum of the optimal cutpoints. In general, we don't regard this issue as particularly important in the real world. I'd like to hear opinions, though.
Concerning the handling of other "good" cutpoints: Currently we don't have plans to return these. Is there much demand for that? I assume the idea is to search for an optimal cutpoint using maximize_metric
or minimize_metric
and then get the additional, say, 5 next best cutpoints in an additional tibble column along with the corresponding metric value.
We'd probably need at least one additional argument to do that:
I'll leave this issue open for a while and I'm interested in other opinions. Thanks everyone.
With version 0.6.0 cutpointr
was enhanced by the option to return multiple optimal cutpoints. The new break_ties
argument specifies if all optimal cutpoints should be returned or if they should be summarized, e.g. using mean
or median
. If break_ties = c
all optimal cutpoints will be returned and the optimal_cutpoint
column becomes a list.
> dat <- data.frame(y = c(0,0,0,1,0,1,1,1), x = 1:8)
> cutpointr(dat, x = x, class = y, break_ties = c, pos_class = 1, direction = ">=")
Multiple optimal cutpoints found
# A tibble: 1 x 15
direction optimal_cutpoint method sum_sens_spec acc
<chr> <list> <chr> <dbl> <list>
1 >= <dbl [2]> maximize_metric 1.75 <dbl [2]>
sensitivity specificity AUC pos_class neg_class prevalence outcome
<list> <list> <dbl> <dbl> <dbl> <dbl> <chr>
1 <dbl [2]> <dbl [2]> 0.938 1.00 0 0.500 y
predictor data roc_curve
<chr> <list> <list>
1 x <tibble [8 × 2]> <data.frame [9 × 10]>
Thank you. I'll try it out!
Hi @Thie1e,
Sorry for continuing this issue but I don't know if always that break_ties = c
should to appear multiple cutpoints... in my case only it's only appearing one.
cutpointr(data = dff2, x = var, class = c_PFS, metric = accuracy,
method = maximize_boot_metric, summary_func = median,
boot_cut = 100, boot_stratify = T, boot_runs = 100,
break_ties = c)
# A tibble: 1 x 16
direction optimal_cutpoint method accuracy acc sensitivity specificity AUC pos_class neg_class prevalence outcome
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <chr>
1 >= 0.85742 maximize_boot_metric 0.683735 0.683735 0.138889 0.946429 0.585627 1 0 0.325301 c_PFS
predictor data roc_curve boot
<chr> <list> <list> <lgl>
1 var<tibble [332 x 2]> <tibble [208 x 9]> NA
Thanks for your help!
Hi,
with method = maximize_boot_metric
you won't get multiple optimal cutpoints, because the returned optimal cutpoint is (in the above example) the median of all optimal cutpoints that were calculated in the 100 (= boot_cut
) bootstrap samples.
There may have been multiple optmal cutpoints in some of the bootstrap samples, but these just contribute to the median.
I'm rather wondering why the boot
column is NA
in the output, because it should be a tibble with 100 rows as specified by boot_runs
. Does the above call really return an NA
there? If so, can you post the data somewhere? Running a similar call on my machine returns the bootstrap data correctly.
Yes, you're right, that's a bit illogical get all cutpoints from a boot, totally agree (sorry for so stupid question).
Regarding yours... no, none NA
was returned. And indeed, I don't know what happens, but I ran again the code and boot
column was right formed (:dizzy_face:):
> cutpointr(data = dff2, x = var, class = c_PFS, metric = accuracy,
+ method = maximize_boot_metric, summary_func = median,
+ boot_cut = 100, boot_stratify = T, boot_runs = 100,
+ break_ties = c)
Assuming the positive class is 1
Assuming the positive class has higher x values
Running bootstrap...
# A tibble: 1 x 16
direction optimal_cutpoint method accuracy acc sensitivity specificity AUC pos_class neg_class prevalence outcome
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <fct> <dbl> <chr>
1 >= 0.78 maximize_boot_metric 0.704735 0.704735 0.155963 0.944 0.623431 1 0 0.303621 c_PFS
predictor data roc_curve boot
<chr> <list> <list> <list>
1 pctCTCs <tibble [359 x 2]> <tibble [209 x 9]> <tibble [100 x 23]>
So, thanks anyway for your help!
OK, glad to hear that. And don't worry, it's not a stupid question.
It makes a subtle difference, because break_ties
still applies to the individual cutpoints of every bootstrap repetition, so the bootstrapped cutpoint may differ depending on break_ties
, even if a seed was set. I just don't think that it makes a substantial difference, especially if boot_cut
is large enough.
Hi, I would like to know if you have any plans to make a function to return multiple cutpoints instead of one cutpoint? What would your approach be in selecting multiple cutpoints?