BERT score: maximum at self-comparison, symmetry, invariance to additional items

GPPassos commented 1 week ago

🐛 Bug

I would be expecting the following properties of BERTscore: 1) given a single list of sentences, and comparing all pairs as preds and targets, BERTscore should be maximum when the same sentence is given as pred[i] and target[i]. 2) for the F1 score, the score should be the same inverting the pred and the target. 3) with idf=False, extending the list of pred and the list of target should not affect the previous input.

There are counterexamples for all of the properties above.

To Reproduce

Steps to reproduce the behavior, run the test suite with the following tests added to test_bertscore.py.

Proposed test suite

```python @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}" assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}" @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]: assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}." pass @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)] ) def test_bertscore_additional_sentence(idf: bool, batch_size: int): """Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False.""" short = "hello there" long = "master kenobi" longer = "general kenobi" preds = [long,long] targets = [long,short] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) longlong = score["f1"][0] longshort = score["f1"][1] # First index should be the self-comparison - sorting by length should not shuffle this assert longlong > longshort preds = preds + [short, longer] targets = targets + [longer, long] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) # First two indices should be exactly as in the previous call to metric assert score["f1"][0] == pytest.approx(longlong) assert score["f1"][1] == pytest.approx(longshort) # Indices 1 and 2 should also be smaller than self-comparison. assert score["f1"][0] > score["f1"][1] assert score["f1"][0] > score["f1"][2] ```

Test results

```console unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] FAILED [ 10%] unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] FAILED [ 20%] unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] FAILED [ 30%] unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] FAILED [ 40%] unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] FAILED [ 50%] unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] FAILED [ 60%] unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] FAILED [ 70%] unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] FAILED [ 80%] unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] FAILED [ 90%] unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] FAILED [100%] ================================================================= FAILURES ================================================================= ___________________________________________________ test_bertscore_most_similar[False-1] ___________________________________________________ idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}" > assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}" E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') E i=5max_target=4 E assert tensor(0.9961) <= tensor(0.9664) unittests/text/test_bertscore.py:220: AssertionError ___________________________________________________ test_bertscore_most_similar[False-9] ___________________________________________________ idf = False, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}" > assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}" E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') E i=5max_target=4 E assert tensor(0.9961) <= tensor(0.9664) unittests/text/test_bertscore.py:220: AssertionError ___________________________________________________ test_bertscore_most_similar[True-1] ____________________________________________________ idf = True, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}" > assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}" E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') E i=5max_target=4 E assert tensor(0.9942) <= tensor(0.9674) unittests/text/test_bertscore.py:220: AssertionError ___________________________________________________ test_bertscore_most_similar[True-9] ____________________________________________________ idf = True, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_most_similar(idf: bool, batch_size: int): """Tests that BERTScore actually gives the highest score to self-similarity.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): max_pred = i%(len(sentences))*(1 + len(sentences)) max_target = int(i/(len(sentences)))*(1 + len(sentences)) assert score["f1"][i] <= score["f1"][max_pred], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_pred], targets[max_pred]}\n{i=}{max_pred=}" > assert score["f1"][i] <= score["f1"][max_target], \ f"pair: {preds[i], targets[i]} does not have a lower score than {preds[max_target], targets[max_target]}\n{i=}{max_target=}" E AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') E i=5max_target=4 E assert tensor(0.9942) <= tensor(0.9674) unittests/text/test_bertscore.py:220: AssertionError _____________________________________________________ test_bertscore_symmetry[False-1] _____________________________________________________ idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]: > assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}." E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). E assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.02979564666748047 E Max relative difference: 0.03083609460462107 E Index | Obtained | Expected E () | 0.96625876 | 0.9960544109344482 ± 1.0e-06 unittests/text/test_bertscore.py:250: AssertionError _____________________________________________________ test_bertscore_symmetry[False-9] _____________________________________________________ idf = False, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]: > assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}." E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). E assert tensor(0.9663) == approx(0.9960...482 ± 1.0e-06) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.02979564666748047 E Max relative difference: 0.03083609460462107 E Index | Obtained | Expected E () | 0.96625876 | 0.9960544109344482 ± 1.0e-06 unittests/text/test_bertscore.py:250: AssertionError _____________________________________________________ test_bertscore_symmetry[True-1] ______________________________________________________ idf = True, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]: > assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}." E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). E assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.027047932147979736 E Max relative difference: 0.02796625947389248 E Index | Obtained | Expected E () | 0.967163 | 0.994210958480835 ± 9.9e-07 unittests/text/test_bertscore.py:250: AssertionError _____________________________________________________ test_bertscore_symmetry[True-9] ______________________________________________________ idf = True, batch_size = 9 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 9), (True, 1), (True, 9)], ) def test_bertscore_symmetry(idf: bool, batch_size: int): """Tests that BERTscore F1 score is symmetric between reference and prediction. As F1 is symmetric, it should also be symmetric.""" short = "hello there" long = "master kenobi" longer = "general kenobi" sentences = [short, long, longer] preds, targets = list(zip(*list(product(sentences, sentences)))) score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) for i in range(len(preds)): for j in range(len(targets)): if preds[i] == targets[j] and preds[j] == targets[i]: > assert score['f1'][i] == pytest.approx(score['f1'][j]), \ f"f1 score for {(preds[i], targets[i])} is not the same as {(preds[j], targets[j])}." E AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). E assert tensor(0.9672) == approx(0.9942...835 ± 9.9e-07) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.027047932147979736 E Max relative difference: 0.02796625947389248 E Index | Obtained | Expected E () | 0.967163 | 0.994210958480835 ± 9.9e-07 unittests/text/test_bertscore.py:250: AssertionError _______________________________________________ test_bertscore_additional_sentence[False-1] ________________________________________________ idf = False, batch_size = 1 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)] ) def test_bertscore_additional_sentence(idf: bool, batch_size: int): """Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False.""" short = "hello there" long = "master kenobi" longer = "general kenobi" preds = [long,long] targets = [long,short] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) longlong = score["f1"][0] longshort = score["f1"][1] # First index should be the self-comparison - sorting by length should not shuffle this assert longlong > longshort preds = preds + [short, longer] targets = targets + [longer, long] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) # First two indices should be exactly as in the previous call to metric assert score["f1"][0] == pytest.approx(longlong) > assert score["f1"][1] == pytest.approx(longshort) E assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.03361696004867554 E Max relative difference: 0.0336169660598575 E Index | Obtained | Expected E () | 0.9999998 | 0.9663828611373901 ± 9.7e-07 unittests/text/test_bertscore.py:289: AssertionError _______________________________________________ test_bertscore_additional_sentence[False-3] ________________________________________________ idf = False, batch_size = 3 @skip_on_connection_issues() @pytest.mark.skipif(not _TRANSFORMERS_GREATER_EQUAL_4_4, reason="test requires transformers>4.4") @pytest.mark.parametrize( ["idf", "batch_size"], [(False, 1), (False, 3)] ) def test_bertscore_additional_sentence(idf: bool, batch_size: int): """Tests that BERTscore keeps the same scores for previous inputs by adding additional elements to the input lists. This should be the case for idf=False.""" short = "hello there" long = "master kenobi" longer = "general kenobi" preds = [long,long] targets = [long,short] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) longlong = score["f1"][0] longshort = score["f1"][1] # First index should be the self-comparison - sorting by length should not shuffle this assert longlong > longshort preds = preds + [short, longer] targets = targets + [longer, long] score = bert_score(preds, targets, idf=idf, lang="en", rescale_with_baseline=False, batch_size=batch_size) # First two indices should be exactly as in the previous call to metric assert score["f1"][0] == pytest.approx(longlong) > assert score["f1"][1] == pytest.approx(longshort) E assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07) E E comparison failed. Mismatched elements: 1 / 1: E Max absolute difference: 0.03361707925796509 E Max relative difference: 0.033617085269168366 E Index | Obtained | Expected E () | 0.9999998 | 0.9663827419281006 ± 9.7e-07 unittests/text/test_bertscore.py:289: AssertionError ========================================================= short test summary info ========================================================== FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[False-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-1] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') FAILED unittests/text/test_bertscore.py::test_bertscore_most_similar[True-9] - AssertionError: pair: ('master kenobi', 'general kenobi') does not have a lower score than ('master kenobi', 'master kenobi') FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[False-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-1] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). FAILED unittests/text/test_bertscore.py::test_bertscore_symmetry[True-9] - AssertionError: f1 score for ('hello there', 'general kenobi') is not the same as ('general kenobi', 'hello there'). FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-1] - assert tensor(1.0000) == approx(0.9663...901 ± 9.7e-07) FAILED unittests/text/test_bertscore.py::test_bertscore_additional_sentence[False-3] - assert tensor(1.0000) == approx(0.9663...006 ± 9.7e-07) ============================================ 10 failed, 33571 deselected, 71 warnings in 14.56s ============================================ ```

Expected behavior

All tests above should pass.

Environment

TorchMetrics version (if build from source, add commit SHA): 2cd6f6a1
Python & PyTorch Version (e.g., 1.0): 3.12.4 & 2.4.1+cu121
Any other relevant information such as OS (e.g., Linux): Linux

Additional context

Maybe this is somehow related to tokenisation or the encoding, but I have not confirmed that. Against this hypothesis is the fact that this still happens for batch_size=1.

Seems related to PR #2347 . Perhaps the sorting is still incorrectly done?

I have also checked that some of those fail on the original implementation mentioned of BERT score. I have considered whether those properties maybe are simply not expected to hold, but I have found nothing in either the paper nor in the documentation suggesting that, when idf=False and there is no baseline correction.

I am happy to submit a PR with the above tests, which currently all fail.

github-actions[bot] commented 1 week ago

Hi! thanks for your contribution!, great first issue!

SkafteNicki commented 1 week ago

cc: @stancld opinions on this?

Lightning-AI / torchmetrics