Closed mmmaat closed 6 years ago
Token evaluation is positional:
>>> from wordseg.evaluate import evaluate
>>> gold = ['ice ice cream is icecream']
>>> text = ['ice icecream is ice cream']
>>> evaluate(text, gold)
OrderedDict([('token_precision', 0.4),
('token_recall', 0.4),
('token_fscore', 0.4),
('type_precision', 1.0),
('type_recall', 1.0),
('type_fscore', 1.0),
...
Some clarification is needed regarding the “token” evaluation metric. Specifically, does it just check the expected token counts for a sentence, or does it check the position as well? For example, consider the gold segmentation “ice ice cream is icecream” and the system output “ice icecream is ice cream”. If evaluation is just over expected token counts, this is 100% correct (ice: 2, cream: 1, icecream: 1, is: 1 for both). However, if the scorer checks the position (e.g., the final word is “icecream”), the system output is not treated as correct. I would expect the “token” metric to compute the latter, but the documentation should be more explicit about how type and token performance is computed.