Scorer.score_tokenization token_acc calculation not as documented

toastynews commented 1 year ago

The doc describes token_acc as:

number of correct tokens / number of gold tokens

but it's actually doing

f1 score of predicted correct tokens as tp, predicted incorrect tokens as fp, and 0 fn

I don't fully understand what this formula is doing.

How to reproduce the behaviour

import spacy
from spacy.tokens import Doc
from spacy.training import Example
from spacy.scorer import Scorer

nlp = spacy.load("en_core_web_sm")

reference_words = ['a','b','c','d','e']
reference_spaces = [False,False,False,False,False]
predicted_words = ['a','bc','de']
predicted_spaces = [False,False,False]

reference = Doc(nlp.vocab, words=reference_words, spaces=reference_spaces)
predicted = Doc(nlp.vocab, words=predicted_words, spaces=predicted_spaces)
example = Example(predicted, reference)

scorer = Scorer(nlp)
scores = scorer.score_tokenization([example])

print(scores)

The result is {'token_acc': 0.5, 'token_p': 0.3333333333333333, 'token_r': 0.2, 'token_f': 0.25} The _p, _r and _f are all correct. It is a bit strange to have an accuracy of 0.5 when only 1 predicted token 'a' is correct out of 5 gold tokens.

Your Environment

spaCy version: 3.3.1
Platform: Linux-5.15.0-1025-gcp-x86_64-with-glibc2.35
Python version: 3.10.8
Pipelines: zh_core_web_sm (3.3.0), en_core_web_sm (3.3.0), zh_core_web_trf (3.3.0)

adrianeboyd commented 1 year ago

Thanks for the report! This looks like a bug in the scorer starting in v3.0 and an incorrect description in the docs, which should have "predicted tokens" instead of "gold tokens".

If you count all correct tokens as true positives and all incorrect predicted tokens as false positives, the intended token_acc score is the precision (and this was correct in v2), but v3 is reporting the f-score instead of the precision.

In general, I'd recommend using token_p/r/f instead.

toastynews commented 1 year ago

Cool. Like you suggested, I am using token_p/r/f which gives me deeper insights than the accuracy number. There is no urgency to merge if config versioning is a problem.

github-actions[bot] commented 1 year ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

explosion / spaCy

Scorer.score_tokenization token_acc calculation not as documented #12033

How to reproduce the behaviour

Your Environment