Thie1e / cutpointr

Optimal cutpoints in R: determining and validating optimal cutpoints in binary classification
https://cran.r-project.org/package=cutpointr
84 stars 13 forks source link

Added Jaccard and Log-Los #24

Closed muschellij2 closed 4 years ago

muschellij2 commented 4 years ago

I added Jaccard Index (https://en.wikipedia.org/wiki/Jaccard_index), which can be calculated from F1, but I figured it'd be good to have.

Also - added log-loss, but it's a little odd given that the values are thresholded.

Thie1e commented 4 years ago

Hi, thanks a lot! I don't have a lot of time to check the additions now, but I'll get back to you. Best, Christian.

Thie1e commented 4 years ago

Hi, I've now checked the additions.

I think adding the Jaccard-Index is a good idea, so I'm going to merge that and write a test. I'm also going to make the function lower case, because the other metric functions are in lower case, too.

Regarding the log-loss, did the supplied function work for you? cutpointr should have thrown an error, because it has to return a matrix with multiple rows for vector inputs, for example log_loss(1:2, 4:5, 3:4, 7:8) should return something like

     log_loss
[1,] 379.9265
[2,] 449.0041

I changed res = cbind(-1 * sum(res)) to res = cbind(-1 * res) in the function to make it work.

Anyway, I'm still trying to wrap my head around the meaning of that metric for a binary predictor, such as in this case.

For a small eps the formula can be simplified to -1 (fp + fn) [some negative number]. The plot of cutpoint values vs. metric values (via plot_metric) has the same shape as the one from misclassification_cost for equal costs, where the latter is easier to interpret. misclassification_cost of course also finds the same optimal cutpoints at the minimal sum of misclassifications.

Was there a certain application for the log loss?

muschellij2 commented 4 years ago

These seem like great enhancements! Many kaggle competitions use log loss for the metric: https://www.kaggle.com/dansbecker/what-is-log-loss Dice/Jaccard is what I use for most imaging applications.

Best, John

On Thu, Dec 12, 2019 at 12:19 PM Christian Thiele notifications@github.com wrote:

Hi, I've now checked the additions.

I think adding the Jaccard-Index is a good idea, so I'm going to merge that and write a test. I'm also going to make the function lower case, because the other metric functions are in lower case, too.

Regarding the log-loss, did the supplied function work for you? cutpointr should have thrown an error, because it has to return a matrix with multiple rows for vector inputs, for example log_loss(1:2, 4:5, 3:4, 7:8) should return something like

 log_loss

[1,] 379.9265 [2,] 449.0041

I changed res = cbind(-1 sum(res)) to res = cbind(-1 res) in the function to make it work.

Anyway, I'm still trying to wrap my head around the meaning of that metric for a binary predictor, such as in this case.

For a small eps the formula can be simplified to -1 (fp + fn) [some negative number]. The plot of cutpoint values vs. metric values (via plot_metric) has the same shape as the one from misclassification_cost for equal costs, where the latter is easier to interpret. misclassification_cost of course also finds the same optimal cutpoints at the minimal sum of misclassifications.

Was there a certain application for the log loss?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Thie1e/cutpointr/pull/24?email_source=notifications&email_token=AAIGPLV2FNFNG4U5JTZPLATQYJXCDA5CNFSM4JMPXDDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGXMIHY#issuecomment-565101599, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIGPLSFVQ6THXYCW6KK7PTQYJXCDANCNFSM4JMPXDDA .

Thie1e commented 4 years ago

Sure, on Kaggle we often develop a model for predicting probabilities and these are then judged by log-loss. cutpointr just dichotomizes the probabilities, as you've noted, and then we could just as well use misclassification_cost which is by default equal to the number of misclassifications, because it can be interpreted more easily than the log-loss and the "loss function" has the same shape.

Did you actually use this in a Kaggle competition or somewhere else? Did it lead to an improvement? I'm just not aware of log-loss for purely binary predictions. Usually, we'd optimize the underlying model directly for log-loss.

muschellij2 commented 4 years ago

I haven't used in Kaggle directly. I've been using for other smaller projects, but I think the addition of Jaccard/Dice and the proposed addition of the misclassification loss you've described is sufficient Best, John

On Mon, Dec 16, 2019 at 11:00 PM Christian Thiele notifications@github.com wrote:

Sure, on Kaggle we often develop a model for predicting probabilities and these are then judged by log-loss. cutpointr just dichotomizes the probabilities, as you've noted, and then we could just as well use misclassification_cost which is by default equal to the number of misclassifications, because it can be interpreted more easily than the log-loss and the "loss function" has the same shape.

Did you actually use this in a Kaggle competition or somewhere else? Did it lead to an improvement? I'm just not aware of log-loss for purely binary predictions. Usually, we'd optimize the underlying model directly for log-loss.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Thie1e/cutpointr/pull/24?email_source=notifications&email_token=AAIGPLWH57WN4HF2PRCMB7TQY6JXVA5CNFSM4JMPXDDKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG67MJY#issuecomment-566097447, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIGPLRX7232NWPDXAZ57JTQY6JXVANCNFSM4JMPXDDA .

Thie1e commented 4 years ago

Oh, maybe this was a misunderstanding - misclassification_cost is already in the package, so I was wondering why you were using log loss instead.

This is a bit tricky. I'm mainly worried that log loss with binary predictions may confuse some users, on the other hand it is apparently not obvious that misclassification_cost with equal costs will find the same cutpoint.

I'm either going to add log loss anyway or I'm going to add a remark about in the help for misclassification_cost.