Closed kapsner closed 3 years ago
Hi @kapsner, Nothing is wrong with the metrics. I can get it working in v1.2 with the following changes to your code:
self.valid_acc = metrics.classification.Accuracy()
self.valid_precision = metrics.classification.Precision(num_classes=1, is_multiclass=False)
self.valid_recall = metrics.classification.Recall(num_classes=1, is_multiclass=False)
self.valid_statscores = metrics.classification.StatScores(num_classes=1, is_multiclass=False)
reset
after calling compute
. Your code does this for the statscore metric but not for the other, which is the reason why you saw that the metrics began to diverge more and more
# compute metrics
for _metname in self._pl_metrics:
self.log(
name="pl/" + _metname,
value=eval("self." + _metname + ".compute()"),
prog_bar=False,
logger=True,
on_step=False,
on_epoch=True
)
eval("self." + _metname + ".reset()") # this is missing
@SkafteNicki thx a lot for your quick answer. Indeed, this was the trick.
Maybe the disclaimer in the documentation is a bit misleading here:
From v1.2 onward compute() will no longer automatically call reset(), and it is up to the user to reset metrics between epochs, except in the case where the metric is directly passed to LightningModule
s
self.log
(https://pytorch-lightning.readthedocs.io/en/stable/extensions/metrics.html#metric-arithmetics)
I thought I was already passing precision/recall/accuracy metrics directly to self.log
and thus do not need to call reset
actively.
🐛 Bug
Since it is working as expected in pl v1.8.1 I am transferring this discussion into a bug report.
I am unable to reproduce correct metrics using pytorch-lightning >=1.2.0.
I want to report "classic" metrics for a binary use case (assuming class label 1 being the positive class):
(see also e.g. https://en.wikipedia.org/wiki/Precision_and_recall)
Using pytorch lightning >= 1.2.0 metrics api gives numbers, that are not reproducible using the values from the confusion matrix (all provided examples below assume that there is no issue with
metrics.classification.ConfusionMatrix
)Please reproduce using the BoringModel
The boring models are all hosted in kaggle notebooks. Metrics from pytorch lightning are compared with those calculated from the confusion matrix. I have tried several combinations of arguments but wasn't able to find a way to reproduce the correct/expected numbers.
To Reproduce (colors in screenshots indicate corresponding metrics)
pl 1.2.2, num_classes=None, is_multiclass=False: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-none-is-multiclass-false
pl 1.2.2, setting num_classes=1 (as suggested by @SkafteNicki in the discussion), is_multiclass=False: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-1-is-multiclass-false
pl 1.2.2, setting num_classes=1, is_multiclass=None: https://www.kaggle.com/nonserial/pl-1-2-2-error-num-cls-1-is-multiclass-none/output --> results in an error message
Expected behavior
I expect to get the same values when using pytorch lightning's metrics API compared to calculating them with the numbers from the confusion matrix.
As one can see, this worked in pytorch lightning version 1.8.1: https://www.kaggle.com/nonserial/pl-1-8-1-metrics-correct --> corresponding metrics are exactly the same, as expected.
Additional context
I did not find time yet to also check and see if f1 score, fbeta and auc are also affected.