Thie1e / cutpointr

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification
https://cran.r-project.org/package=cutpointr
85 stars 13 forks source link

[BUG]: cutpointr() not recognizing character vectors for "x" argument input via lapply() #31

Closed AngelCampos closed 4 years ago

AngelCampos commented 4 years ago

I'm having many issues when trying to create cutpointr models through lapply().

This is an example that I find easy to understand why I think it should work but is not.

In this case, when I try to pass a character vector to the argument x of cutpoinitr() using an lapply() it is not recognizing the existence of such an object.

image

My way around it is to input predictions and class as vectors instead of in a data.frame, but would prefer to do it the way I suppose it is intended. Is there something that could be done?, maybe something that has to do with tidy evaluation eval_tidy()?

Best regards

Thie1e commented 4 years ago

Hi,

I guess you're right about the source of the error. You probably just have to use !! to unqoute iVar. Here's an example that should work:

library(cutpointr)

varlist <- c("dsi", "age")
lapply(varlist, function(var) {
    cutpointr(data = suicide, x = !!var, class = suicide)
})
#> Assuming the positive class is yes
#> Assuming the positive class has higher x values
#> Assuming the positive class is no
#> Assuming the positive class has higher x values
#> [[1]]
#> # A tibble: 1 x 16
#>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity
#>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>
#> 1 >=                       2 maximize_metric       1.75179 0.864662    0.888889
#>   specificity      AUC pos_class neg_class prevalence outcome predictor
#>         <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>   <chr>    
#> 1    0.862903 0.923779 yes       no         0.0676692 suicide dsi      
#>   data               roc_curve          boot 
#>   <list>             <list>             <lgl>
#> 1 <tibble [532 x 2]> <tibble [13 x 10]> NA   
#> 
#> [[2]]
#> # A tibble: 1 x 16
#>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity
#>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>
#> 1 >=                      56 maximize_metric       1.11537 0.199248    0.143145
#>   specificity      AUC pos_class neg_class prevalence outcome predictor
#>         <dbl>    <dbl> <fct>     <fct>          <dbl> <chr>   <chr>    
#> 1    0.972222 0.525678 no        yes         0.932331 suicide age      
#>   data               roc_curve          boot 
#>   <list>             <list>             <lgl>
#> 1 <tibble [532 x 2]> <tibble [61 x 10]> NA

Created on 2020-10-27 by the reprex package (v0.3.0)

By the way, maybe multi_cutpointr can be an alternative here. varList are just columns from tmpData, right? Then, if you would rather have a data.frame instead of a list, you can do

library(cutpointr)

multi_cutpointr(suicide, x = c("age", "dsi"), class = suicide,
                pos_class = "yes")
#> age:
#> Assuming the positive class has lower x values
#> dsi:
#> Assuming the positive class has higher x values
#> # A tibble: 2 x 16
#>   direction optimal_cutpoint method          sum_sens_spec      acc sensitivity
#>   <chr>                <dbl> <chr>                   <dbl>    <dbl>       <dbl>
#> 1 <=                      55 maximize_metric       1.11537 0.199248    0.972222
#> 2 >=                       2 maximize_metric       1.75179 0.864662    0.888889
#>   specificity      AUC pos_class neg_class prevalence outcome predictor
#>         <dbl>    <dbl> <chr>     <fct>          <dbl> <chr>   <chr>    
#> 1    0.143145 0.525678 yes       no         0.0676692 suicide age      
#> 2    0.862903 0.923779 yes       no         0.0676692 suicide dsi      
#>   data               roc_curve          boot 
#>   <list>             <list>             <lgl>
#> 1 <tibble [532 x 2]> <tibble [61 x 10]> NA   
#> 2 <tibble [532 x 2]> <tibble [13 x 10]> NA

Created on 2020-10-27 by the reprex package (v0.3.0)

AngelCampos commented 4 years ago

Yes, I tried multi_cutpointr() first, but the problem I encountered is that when using the multiple-variables models created with multi_cupointr() to predict with new data using the predict() function I would get an "error: C stack usage 19923892 is too close to the limit". I even tried to subset the new data, as suggested in similar problems with this kind of error, and that didn't work either.

In the end, I circumvented creating new data frames and changing the $predictor variable name, and so on until it worked. Didn't know about "!!" to unquote, sadly. I will try it later on, it would be a more succinct solution.

Is the "C stack usage XXXXX is to close to the limit" error something your users have experienced before? Or never heard of it? I would try to document it if encounter it again. I would try with some in-built data to see if I can reproduce the error, just not today 😜 .

Thanks

AngelCampos commented 4 years ago

Just to close the issue. Yes, the problem was solved using !!.

Any reference you could point me to, to better understand the behavior? I have never used this operator before.

Thie1e commented 4 years ago

That "C stack usage" error is a bit weird. Predicting with multi_cutpointr objects is simply not supported (and probably also won't be supported in the future) and should throw the error no applicable method for 'predict' applied to an object of class "c('multi_cutpointr', 'tbl_df', 'tbl', 'data.frame')". Maybe we can print a more helpful error message there, so thanks for the pointer.

That !! operator ("bang bang") is a very standard way of unquoting variables in functions that use tidy evaluation. I think the plan was or is to replace it by {{ ("curly-curly"), but both will work in the above example. There's for example a blog post on that (https://www.brodrigues.co/blog/2019-06-20-tidy_eval_saga/), a maybe already superseded vignette on programming with dplyr where !! is mentioned (http://rstudio-pubs-static.s3.amazonaws.com/328769_e8a0152e155b4163b4a54473adcea229.html) and a more technical explanation in Advanced R (https://adv-r.hadley.nz/quasiquotation.html).

Anyway, glad the function works now.

AngelCampos commented 4 years ago

Thanks for the references. Best