Double tokenization - Githubissues

skurzhanskyi commented 4 years ago

Thanks a lot for your paper. I have a question regarding your evaluation script: Why do you tokenize all the data while calculating the SARI score if it's already tokenized? https://github.com/facebookresearch/access/blob/ede11a0fc4bd3e8542c6e6ff49b677a4e0927cbc/access/evaluation/general.py#L28-L31 If the tokenizer is not specified easse uses 13a underneath, but it seems the data (including the one in turkcorpus_test_legacy) is already tokenized.

louismartin commented 4 years ago

Hi,

Thanks for pointing this out. Yes there was double tokenization in EASSE in the version that we used (the test sets were tokenized and lowercased, then tokenized again by EASSE, so we matched this with our system outputs). It should be fixed in later versions if you use turkcorpus_test (instead of turkcorpus_test_legacy) and feed untokenized predictions. The untokenized version of our predictions can be found in EASSE, evaluate with easse evaluate -t turkcorpus_test < easse/resources/data/system_outputs/turkcorpus/test/ACCESS.

skurzhanskyi commented 4 years ago

So do you mean it's better to use untokenized Turk Corpus and tokenize it later with 13a to calculate the SARI on WikiLarge? By the way, by using your command I got 41.381, not 41.87 🤔

louismartin commented 4 years ago

Yes for the future it is better to use raw/untokenized system output and raw/untokenized references from turkcorpus and then feed them to the latest version of EASSE.

The 41.87 can be obtained by using running the https://github.com/facebookresearch/access/blob/master/scripts/evaluate.py script or using https://github.com/facebookresearch/access/blob/master/system_output/test.pred system outputs but with the older version of EASSE: pip uninstall easse && pip install easse@git+git://github.com/feralvam/easse.git@090855e73dee5e26ea0cda01d4aa4f51044d9af9. This was done to match other tokenized system outputs for other models that we reported in the paper. Since then we decided that it is better to use untokenized system outputs and let EASSE do the tokenization (+ some fixes in SARI in EASSE).

The 41.38 is the score that you get when using raw system outputs from ACCESS with latest version of EASSE. It is the score that we reported in our last paper on multilingual simplification: https://arxiv.org/pdf/2005.00352.pdf (Table 3)

skurzhanskyi commented 4 years ago

Thanks for your answers

facebookresearch / access

Double tokenization #17