Wrong metrics in version >= 1.2.0

kapsner commented 3 years ago

🐛 Bug

Since it is working as expected in pl v1.8.1 I am transferring this discussion into a bug report.

I am unable to reproduce correct metrics using pytorch-lightning >=1.2.0.

I want to report "classic" metrics for a binary use case (assuming class label 1 being the positive class):

precision = PPV = TP / (TP + FP)
recall = sensitivity = TP / (TP + FN)
accuracy = (TP + TN) / (TP + FP + TN + FN)

Using pytorch lightning >= 1.2.0 metrics api gives numbers, that are not reproducible using the values from the confusion matrix (all provided examples below assume that there is no issue with metrics.classification.ConfusionMatrix)

Please reproduce using the BoringModel

The boring models are all hosted in kaggle notebooks. Metrics from pytorch lightning are compared with those calculated from the confusion matrix. I have tried several combinations of arguments but wasn't able to find a way to reproduce the correct/expected numbers.

To Reproduce (colors in screenshots indicate corresponding metrics)

pl 1.2.2, num_classes=None, is_multiclass=False: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-none-is-multiclass-false
pl 1.2.2, setting num_classes=1 (as suggested by @SkafteNicki in the discussion), is_multiclass=False: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-1-is-multiclass-false
pl 1.2.2, setting num_classes=1, is_multiclass=None: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-1-is-multiclass-none/output --> results in an error message

ValueError: You have set num_classes=1, but predictions are integers. If you want to convert (multi-dimensional) multi-class data with 2 classes to binary/multi-label, set is_multiclass=False.

Expected behavior

I expect to get the same values when using pytorch lightning's metrics API compared to calculating them with the numbers from the confusion matrix.

As one can see, this worked in pytorch lightning version 1.8.1: https://www.kaggle.com/nonserial/pl-1-8-1-metrics-correct --> corresponding metrics are exactly the same, as expected.

Additional context

I did not find time yet to also check and see if f1 score, fbeta and auc are also affected.

SkafteNicki commented 3 years ago

Hi @kapsner, Nothing is wrong with the metrics. I can get it working in v1.2 with the following changes to your code:

initialize the metrics as the following:

self.valid_acc = metrics.classification.Accuracy()
self.valid_precision = metrics.classification.Precision(num_classes=1, is_multiclass=False)
self.valid_recall = metrics.classification.Recall(num_classes=1, is_multiclass=False)
self.valid_statscores = metrics.classification.StatScores(num_classes=1, is_multiclass=False)

remember to call reset after calling compute. Your code does this for the statscore metric but not for the other, which is the reason why you saw that the metrics began to diverge more and more

# compute metrics
for _metname in self._pl_metrics:
self.log(
    name="pl/" + _metname,
    value=eval("self." + _metname + ".compute()"),
    prog_bar=False,
    logger=True,
    on_step=False,
    on_epoch=True
)
eval("self." + _metname + ".reset()") # this is missing

kapsner commented 3 years ago

@SkafteNicki thx a lot for your quick answer. Indeed, this was the trick.

Maybe the disclaimer in the documentation is a bit misleading here:

From v1.2 onward compute() will no longer automatically call reset(), and it is up to the user to reset metrics between epochs, except in the case where the metric is directly passed to LightningModulesself.log

(https://pytorch-lightning.readthedocs.io/en/stable/extensions/metrics.html#metric-arithmetics)

I thought I was already passing precision/recall/accuracy metrics directly to self.log and thus do not need to call reset actively.

Lightning-AI / pytorch-lightning