IBM / unitxt

🦄 Unitxt: a python library for getting data fired up and set for training and evaluation
https://www.unitxt.ai
Apache License 2.0
153 stars 39 forks source link

matthews_correlation returning 0 on perfect correlation #439

Open yoavkatz opened 9 months ago

yoavkatz commented 9 months ago

Why is this the accepted behavior (strict=False was set a long time ago)?


The results of running the main metric in used in the card (matthews_correlation) over simulated predictions that are equal to the references returns a different score than expected. One would expect a perfect score of 1.0 in this case, but returned metric score was 0.0. This is flagged as only as a warning because strict=False was set in the call to test_card().The predictions passed to the metrics were: ['acceptable', 'acceptable', 'acceptable']


dafnapension commented 8 months ago

reproducible via prepare.card.cola.py

dafnapension commented 8 months ago

For the case in question, where predictions = references = ['acceptable', 'acceptable', 'acceptable'], by the book: MCC we only have TP here (or only TN), all three other components are 0, so the end result is 0.0

And an elaborated proof "by hand": mcc2

yoavkatz commented 8 months ago

The HF metric calls scikit-learn:

https://huggingface.co/spaces/evaluate-metric/matthews_correlation/blame/0da51560adeb410656ba31b4cd1807c990898398/matthews_correlation.py

from sklearn.metrics import matthews_corrcoef

def _compute(self, predictions, references, sample_weight=None): return { "matthews_correlation": float(matthews_corrcoef(references, predictions, sample_weight=sample_weight)), }

[ Dot it ](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html)

yoavkatz commented 8 months ago

I think the issue that in v=[0,0,0] or v=[1,1,1] - there is only a single class. This is a special case not treated in the implementation.

yoavkatz commented 8 months ago

This seems to be a known issue that has a PR , but was not fixed.

https://github.com/scikit-learn/scikit-learn/issues/25258

dafnapension commented 8 months ago

scikit's implementation faithfully follows the definition (as there is only TN or only TP, and all other three components are 0, hence the result, by definition of matthew_coef, is 0). The question is whether for our case, when for testing a metric, we 'fake' full hit, or full miss, we should tweak the fake..

yoavkatz commented 8 months ago

Right. The metric is ill defined in this case 0/0. They suggest in the above issue to have a special flag for this, but they did solve this yet.

Can you repeat the above code with ref and pred each enumerating on (0,0),(0,1), (1,0), and (1,1) independently. I want to see all the corner cases.

pred. ref expected result (0,0) (1,1) 0 (1,1) (1,1) 1 (0,0) (0,0) 1 (1,1) (0,0) 0

dafnapension commented 8 months ago

Gladly, I think that in all of your cases, there is only a single input term that is 2, and the tree three others == 0, so the nominator is 0 in all of your cases:

mcc_yoavs_corners

yoavkatz commented 8 months ago

Ok. So we should add a check, that if all the predictions are the same value (p), and all the references are the same value (r), we return 0 if p !=r and 1 if p=r.

Can you also check that all these are between 0 and 1?

(1,0) (1,1)
(0,1) (1,1)
(1,0) (0,0)
(0,1) (0,0)

dafnapension commented 8 months ago

total loss for Matthews is -1, not 0:

mcc_yoavs_corners

I think that since Matthews returns 0, by definition, for any case that the nominator in the formula is 0, (namely: (either TP or TN is 0) and (either FP or FN is 0)), no matter how nice the predictions are, I suggest to add a warning message in such a case, rather than override Matthews.

yoavkatz commented 8 months ago

Yes. You are right - as this is correlation [1,0] and [0,1] are indeed anti-correlated (-1).

You can see what they did in f1 (and what they plan to do got Matthews) here:

https://github.com/scikit-learn/scikit-learn/pull/25531/files

zero_division : {"warn", 0.0, 1.0, np.nan}, default="warn" Sets the value to return when there is a zero division, i.e. when all predictions and labels are negative. If set to "warn", this acts as 0, but warnings are also raised. predictions and labels are negative.

However, we don't have use for warning. Noone sees them , as the results are stored and viewed in a report. So we can return np.nan - but it would be odd for a perfect prediction to return correlation of nan.