explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.69k stars 4.36k forks source link

Scorer.score_tokenization token_acc calculation not as documented #12033

Closed toastynews closed 1 year ago

toastynews commented 1 year ago

The doc describes token_acc as:

but it's actually doing

I don't fully understand what this formula is doing.

How to reproduce the behaviour

import spacy
from spacy.tokens import Doc
from spacy.training import Example
from spacy.scorer import Scorer

nlp = spacy.load("en_core_web_sm")

reference_words = ['a','b','c','d','e']
reference_spaces = [False,False,False,False,False]
predicted_words = ['a','bc','de']
predicted_spaces = [False,False,False]

reference = Doc(nlp.vocab, words=reference_words, spaces=reference_spaces)
predicted = Doc(nlp.vocab, words=predicted_words, spaces=predicted_spaces)
example = Example(predicted, reference)

scorer = Scorer(nlp)
scores = scorer.score_tokenization([example])

print(scores)

The result is {'token_acc': 0.5, 'token_p': 0.3333333333333333, 'token_r': 0.2, 'token_f': 0.25} The _p, _r and _f are all correct. It is a bit strange to have an accuracy of 0.5 when only 1 predicted token 'a' is correct out of 5 gold tokens.

Your Environment

adrianeboyd commented 1 year ago

Thanks for the report! This looks like a bug in the scorer starting in v3.0 and an incorrect description in the docs, which should have "predicted tokens" instead of "gold tokens".

If you count all correct tokens as true positives and all incorrect predicted tokens as false positives, the intended token_acc score is the precision (and this was correct in v2), but v3 is reporting the f-score instead of the precision.

In general, I'd recommend using token_p/r/f instead.

toastynews commented 1 year ago

Cool. Like you suggested, I am using token_p/r/f which gives me deeper insights than the accuracy number. There is no urgency to merge if config versioning is a problem.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.