Open aimfeld opened 3 days ago
Here's a test case for new_metrics.py
which fails for the example above:
# Case 8: https://github.com/ChadiHelwe/MAFALDA/issues/2
text81 = 'Two of my best friends are really introverted, shy people, and they both have cats.'
text82 = 'That leads to me believe that most cat lovers are really shy.'
gd8 = AnnotatedText(
[
GroundTruthSpan(text82, {13}, [84, 145]),
GroundTruthSpan(text81, {None}, [0, 83])
]
)
pd81 = AnnotatedText(
[
PredictionSpan(text82, 13, [84, 145]),
PredictionSpan(text81, 0, [0, 83])
]
)
p, r, f1 = text_full_task_p_r_f1(pd81, gd8)
assert p == 1, f"Expected precision is 1 but got {p}"
assert r == 1, f"Expected recall is 1 but got {r}"
assert f1 == 1, f"Expected F1 is 1 but got {f1}"
My current workaround is to measure performance without adding uncovered spans to the gold standard ({None}
} and predictions (0
).
Btw, I really like your paper and the disjunctive annotations, and also the taxonomy. I also work with the FALLACIES dataset and benchmark by Hong et al. (2024). It's a good benchmark as well, but 232 fallacy types are too many.
I'm replicating your interesting work as a part of my data science thesis, using a different prompt method and larger LLMs (GPT-4o, GPT-4o Mini). I'd love to get some feedback regarding a potential issue I found in your evaluation metrics.
For sentences where the model does not predict a fallacy, you add a
'nothing' (0)
label here: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/evaluate.py#L236However, in your ground truth spans, you add
None
labels for uncovered text spans, rather than0
labels: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/evaluate.py#L363I might be wrong, but I think you might be underestimating your performance metrics. I used your code to evaluate my model's predictions, calling this function: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/new_metrics.py#L189
For example, the following perfect prediction according to your gold standard results only in precision = 0.5, recall 0.5, f1 = 0.5:
Two of my best friends are really introverted, shy people, and they both have cats. That leads to me believe that most cat lovers are really shy.
build_ground_truth_spans()
:That leads to me believe that most cat lovers are really shy. - [84, 145] - {13}
Two of my best friends are really introverted, shy people, and they both have cats. - [0, 83] - {None}
That leads to me believe that most cat lovers are really shy. - [84, 145] - 13
Two of my best friends are really introverted, shy people, and they both have cats. - [0, 83] - 0]
If I remove the nothing (
0
) prediction span from pred_annotations, I get precision = 1.0, recall 0.5, f1 = 0.666 If I also remove theNone
span from the gold_annotations, I get precision = 1.0, recall 1.0, f1 = 1.0In your test cases you don't add
None
label GroundTruthSpans, nor do you add0
label spans to PredictionSpans for all uncovered text. Therefore you don't see this issue in your test cases.