Closed jacanchaplais closed 3 years ago
Hi! thanks for your contribution!, great first issue!
Actually, thinking about it, there may well be a difference between the average of the harmonic means over a validation set, and the harmonic mean of the averages over that set. I think this may explain it, apologies. I will do some maths and close the issue if it turns out I am being a moron.
I am a moron confirmed. Apologies for wasting your time.
import numpy as np
prec_recall_set = np.random.rand(100, 2)
all_f1s = np.array([harmonic_mean(prec_recall) for prec_recall in prec_recall_set])
mean_f1 = all_f1s.mean()
mean_prec_recall = np.mean(prec_recall_set, axis=0)
f1_of_means = harmonic_mean(mean_prec_recall)
print(mean_f1)
>>> 0.4254433011584456
print(f1_of_means)
>>> 0.5107380180397513
@jacanchaplais no need, thx for your concern in the implementation correctness :rabbit:
I think this point is actually well worth clarifying in the documentation, and it might even be worth adding a macro-pre
(or, alternatively, pre-harmonic-macro
or early-macro
) averaging method that corresponds to the OP's original interpretation.
From this note and my own experience/investigations, both interpretation of macro-averaged F1 exist in the literature. And although it seems that torchmetrics, sklearn, and tensorflow are all consistent on their interpretation, having them both available in torchmetrics
could ease comparison with prior work and highlight the distinction so that whatever is chosen is properly described.
As to the source for the OP's original interpretation (and my own!)?: The book commonly credited with introducing the F-measure discusses micro- and macro- averaging schemes in the context of precision-recall curves prior to introducing the equivalent of F1 (see this article for a derivation connecting the two; the original definition in this article is mostly reprinted in the book verbatim but does not mention either type of averaging). After describing micro- and macro-averaged precision and recall, the book then defines the "effectiveness" measure, F, in terms of precision and recall (without reference to any type of averaging) so that might be the source of that interpretation. (It would be interesting to ask Dr. van Rijsbergen what his recommendation would have been.)
🐛 Bug
Hi there. I am using the modular forms of
BinnedPrecisionRecallCurve
andF1
to calculate precision, recall, and F1. I noticed that F1 went down while both precision and recall went up, so I manually checked that F1 was consistent with the harmonic mean, and it seemed to be mostly too high.eg. for the last datapoints on my precision (0.1296), recall (0.77197) and f1 (0.3508) graphs
which is significantly smaller than the value reported by the metric.
Am I missing something? I am running a tuning algorithm to optimise the F1 score, so it is crucial that it is correct.
Code sample
You can see my implementation of the metrics here (from line 70 to EOF), it is relatively simple and (I think) follows the recommendations in the documentation.
https://github.com/jacanchaplais/cluster_gnn/blob/f-gnn/src/cluster_gnn/models/gnn.py#L70
Environment
Installed via conda with following environment.yml
Automatically determined version numbers:
PyTorch Lightning 1.3.1
TorchMetrics 0.3.2
TensorBoard 2.4.1
Ray[tune] 1.1.0
I am also using PyTorch Geometric installed via pip, although this shouldn't be relevant as the metrics are never exposed it.