Potential problem in your evaluation metrics

I'm replicating your interesting work as a part of my data science thesis, using a different prompt method and larger LLMs (GPT-4o, GPT-4o Mini). I'd love to get some feedback regarding a potential issue I found in your evaluation metrics.

For sentences where the model does not predict a fallacy, you add a 'nothing' (0) label here: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/evaluate.py#L236

However, in your ground truth spans, you add None labels for uncovered text spans, rather than 0 labels: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/evaluate.py#L363

I might be wrong, but I think you might be underestimating your performance metrics. I used your code to evaluate my model's predictions, calling this function: https://github.com/ChadiHelwe/MAFALDA/blob/0df434477b914a20f55c0592ba05a53fe924c65b/src/new_metrics.py#L189

For example, the following perfect prediction according to your gold standard results only in precision = 0.5, recall 0.5, f1 = 0.5:

text: Two of my best friends are really introverted, shy people, and they both have cats. That leads to me believe that most cat lovers are really shy.
gold_annotations, created with your function build_ground_truth_spans():
- That leads to me believe that most cat lovers are really shy. - [84, 145] - {13}
- Two of my best friends are really introverted, shy people, and they both have cats. - [0, 83] - {None}
pred_annotations:
- That leads to me believe that most cat lovers are really shy. - [84, 145] - 13
- Two of my best friends are really introverted, shy people, and they both have cats. - [0, 83] - 0]

If I remove the nothing (0) prediction span from pred_annotations, I get precision = 1.0, recall 0.5, f1 = 0.666 If I also remove the None span from the gold_annotations, I get precision = 1.0, recall 1.0, f1 = 1.0

In your test cases you don't add None label GroundTruthSpans, nor do you add 0 label spans to PredictionSpans for all uncovered text. Therefore you don't see this issue in your test cases.

Here's a test case for new_metrics.py which fails for the example above:

# Case 8: https://github.com/ChadiHelwe/MAFALDA/issues/2
text81 = 'Two of my best friends are really introverted, shy people, and they both have cats.'
text82 = 'That leads to me believe that most cat lovers are really shy.'

gd8 = AnnotatedText(
    [
        GroundTruthSpan(text82, {13}, [84, 145]),
        GroundTruthSpan(text81, {None}, [0, 83])
    ]
)

pd81 = AnnotatedText(
    [
        PredictionSpan(text82, 13, [84, 145]),
        PredictionSpan(text81, 0, [0, 83])
    ]
)

p, r, f1 = text_full_task_p_r_f1(pd81, gd8)
assert p == 1, f"Expected precision is 1 but got {p}"
assert r == 1, f"Expected recall is 1 but got {r}"
assert f1 == 1, f"Expected F1 is 1 but got {f1}"

My current workaround is to measure performance without adding uncovered spans to the gold standard ({None}} and predictions (0). Btw, I really like your paper and the disjunctive annotations, and also the taxonomy. I also work with the FALLACIES dataset and benchmark by Hong et al. (2024). It's a good benchmark as well, but 232 fallacy types are too many.

ChadiHelwe / MAFALDA

Potential problem in your evaluation metrics #2