Clarification on metrics

rbiswas4 commented 6 years ago

I went through your notebook outline which has compiled a large number of metrics, many of I was totally unaware of. Thanks Alex for setting this up to bring other alternatives into the discussion!

had the following questions:

ROC : Is there any literature about how to consider multiple thresholds?
Standard Score (zScore) : Is the wikipedia article linked correctly?
Logloss : If my quick reading was correct, this seems to suggest using an objective function (which is good but hard in general). Why could one not have a multi-class objective function though?
Brier Score : Sounds interesting, except from the definition it seems to make sense for time dependent phenomena. Did you have this in mind for having a metric looking at classification over time (early and late) which you have suggested earlier, or is there a way to use this for the complete classification problem too (without becoming a simpler score)?
Precision Recall AUC : Why is this supposed to be better than (ROC) AUC ? Is is some kind of smoothness for ver few objects, where statistics does not work anyway?
I assume that by probabilistic vs deterministic metrics, you mean metrics that act on results of classification with a 0,1 score or on probabilities. Is there a good example of the latter? I would have guessed something like Entropy or a loss function based on results in cosmology.

aimalz commented 6 years ago

I haven't done the deepest literature search (I'd give you links, but it's really just everything on the first page of Google search results for "criticism of auc roc") but think the idea is that all thresholds are considered to get a final result. Or do you mean different thresholds for different classes? I've been wondering about that myself, as it's related to how to treat classes differently with weights or something.
The link is where one is directed from the ROC page here.
The implementations I saw on Kaggle for the multi-class log-loss used what was more or less an average with equal weight to each object that doesn't really account for hierarchy or covariance, but you're right that it doesn't necessarily have to be that way. I think your idea of a customized objective function is spot-on and very much the right direction to be thinking in.
Maybe I'm being dense, but in what way is the Brier score specific to time-dependent data?
Yes, the issue is effective extrapolation when there are too few objects. (Some attention is given to this matter in the "literature search" results mentioned above, in the context of medicine.)
I actually meant something closer to the classification problem and farther from the science. Deterministic metrics would include those derived from accuracy, precision, recall, or a confusion matrix (like the Matthews Correlation Coefficient). (And then there's this one that I'm having trouble categorizing.) Varying a threshold over probabilistic classifications (as in the ROC/AUC) could be applied to other deterministic metrics to obtain a novel probabilistic metric. I'm hoping to get a feel for how such a thing would behave. As for science metrics, I think it would be hard to use any that didn't prioritize one science goal over others or make too many assumptions, so we might want to stay away from that.

aimalz commented 6 years ago

I'm going to close this since a lot of the questions were settled with #3.

aimalz / proclam

Clarification on metrics #1