Lightning-AI / torchmetrics

Machine learning metrics for distributed, scalable PyTorch applications.
https://lightning.ai/docs/torchmetrics/
Apache License 2.0
2.14k stars 408 forks source link

`rouge_score` with `accumulate='best'` gives mixed results #2148

Closed volksen closed 2 days ago

volksen commented 1 year ago

🐛 Bug

Hi,

when using the rouge_score with accumulate="best", the results are dependent on the order of the labels. As of my understanding, accumulate="best" should return the best f score over all references.

Minimal example:

from torchmetrics.functional.text import rouge_score

preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]

print(rouge_score(preds, references, accumulate='best'))
print(rouge_score(preds, references_rev, accumulate='best'))

gives different results:

{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(1.), 'rouge2_precision': tensor(1.), 'rouge2_recall': tensor(1.), 'rougeL_fmeasure': tensor(1.), 'rougeL_precision': tensor(1.), 'rougeL_recall': tensor(1.), 'rougeLsum_fmeasure': tensor(1.), 'rougeLsum_precision': tensor(1.), 'rougeLsum_recall': tensor(1.)}
{'rouge1_fmeasure': tensor(1.), 'rouge1_precision': tensor(1.), 'rouge1_recall': tensor(1.), 'rouge2_fmeasure': tensor(0.), 'rouge2_precision': tensor(0.), 'rouge2_recall': tensor(0.), 'rougeL_fmeasure': tensor(0.3333), 'rougeL_precision': tensor(0.3333), 'rougeL_recall': tensor(0.3333), 'rougeLsum_fmeasure': tensor(0.3333), 'rougeLsum_precision': tensor(0.3333), 'rougeLsum_recall': tensor(0.3333)}

Did I missread the documentation or is this a bug. Accumulate='avg' works as expected. Maybe the bug is in https://github.com/Lightning-AI/torchmetrics/blob/v1.1.0/src/torchmetrics/functional/text/rouge.py#L378 where there is a todo comment.

I compared the results to the rouge-score package:

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=False)
preds = "a b c"
references = ["a b c", "c b a"]
references_rev = ["c b a", "a b c"]
print(scorer.score_multi(references, preds))
print(scorer.score_multi(references_rev, preds))

which gives the same results in both cases:

{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}
{'rouge1': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rouge2': Score(precision=1.0, recall=1.0, fmeasure=1.0), 'rougeL': Score(precision=1.0, recall=1.0, fmeasure=1.0)}

Environment

github-actions[bot] commented 1 year ago

Hi! thanks for your contribution!, great first issue!

stancld commented 1 year ago

Thanks for the report! Gonna check this weekend.

Borda commented 9 months ago

Thanks for the report! Gonna check this weekend.

@stancld, did you have a chance to have a look at it? :rabbit:

rittik9 commented 2 months ago

@Borda pls assign it to me