Token evaluation: positional or count?

bootphon / wordseg

A Python toolbox for text based word segmentation

GNU General Public License v3.0

16 stars 7 forks source link

Some clarification is needed regarding the “token” evaluation metric. Specifically, does it just check the expected token counts for a sentence, or does it check the position as well? For example, consider the gold segmentation “ice ice cream is icecream” and the system output “ice icecream is ice cream”. If evaluation is just over expected token counts, this is 100% correct (ice: 2, cream: 1, icecream: 1, is: 1 for both). However, if the scorer checks the position (e.g., the final word is “icecream”), the system output is not treated as correct. I would expect the “token” metric to compute the latter, but the documentation should be more explicit about how type and token performance is computed.

>>> from wordseg.evaluate import evaluate >>> gold = ['ice ice cream is icecream'] >>> text = ['ice icecream is ice cream'] >>> evaluate(text, gold) OrderedDict([('token_precision', 0.4), ('token_recall', 0.4), ('token_fscore', 0.4), ('type_precision', 1.0), ('type_recall', 1.0), ('type_fscore', 1.0), ...

bootphon / wordseg

Token evaluation: positional or count? #46