Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.23k stars 3.38k forks source link

Wrong metrics in version >= 1.2.0 #6396

Closed kapsner closed 3 years ago

kapsner commented 3 years ago

🐛 Bug

Since it is working as expected in pl v1.8.1 I am transferring this discussion into a bug report.

I am unable to reproduce correct metrics using pytorch-lightning >=1.2.0.

I want to report "classic" metrics for a binary use case (assuming class label 1 being the positive class):

(see also e.g. https://en.wikipedia.org/wiki/Precision_and_recall)

Using pytorch lightning >= 1.2.0 metrics api gives numbers, that are not reproducible using the values from the confusion matrix (all provided examples below assume that there is no issue with metrics.classification.ConfusionMatrix)

Please reproduce using the BoringModel

The boring models are all hosted in kaggle notebooks. Metrics from pytorch lightning are compared with those calculated from the confusion matrix. I have tried several combinations of arguments but wasn't able to find a way to reproduce the correct/expected numbers.

To Reproduce (colors in screenshots indicate corresponding metrics)

Expected behavior

I expect to get the same values when using pytorch lightning's metrics API compared to calculating them with the numbers from the confusion matrix.

As one can see, this worked in pytorch lightning version 1.8.1: https://www.kaggle.com/nonserial/pl-1-8-1-metrics-correct image --> corresponding metrics are exactly the same, as expected.

Additional context

I did not find time yet to also check and see if f1 score, fbeta and auc are also affected.

SkafteNicki commented 3 years ago

Hi @kapsner, Nothing is wrong with the metrics. I can get it working in v1.2 with the following changes to your code:

  1. initialize the metrics as the following:
    self.valid_acc = metrics.classification.Accuracy()
    self.valid_precision = metrics.classification.Precision(num_classes=1, is_multiclass=False)
    self.valid_recall = metrics.classification.Recall(num_classes=1, is_multiclass=False)
    self.valid_statscores = metrics.classification.StatScores(num_classes=1, is_multiclass=False)
  2. remember to call reset after calling compute. Your code does this for the statscore metric but not for the other, which is the reason why you saw that the metrics began to diverge more and more
    # compute metrics
    for _metname in self._pl_metrics:
    self.log(
        name="pl/" + _metname,
        value=eval("self." + _metname + ".compute()"),
        prog_bar=False,
        logger=True,
        on_step=False,
        on_epoch=True
    )
    eval("self." + _metname + ".reset()") # this is missing
kapsner commented 3 years ago

@SkafteNicki thx a lot for your quick answer. Indeed, this was the trick.

Maybe the disclaimer in the documentation is a bit misleading here:

From v1.2 onward compute() will no longer automatically call reset(), and it is up to the user to reset metrics between epochs, except in the case where the metric is directly passed to LightningModulesself.log

(https://pytorch-lightning.readthedocs.io/en/stable/extensions/metrics.html#metric-arithmetics)

I thought I was already passing precision/recall/accuracy metrics directly to self.log and thus do not need to call reset actively.