Closed skurzhanskyi closed 4 years ago
Hi,
Thanks for pointing this out. Yes there was double tokenization in EASSE in the version that we used (the test sets were tokenized and lowercased, then tokenized again by EASSE, so we matched this with our system outputs).
It should be fixed in later versions if you use turkcorpus_test (instead of turkcorpus_test_legacy) and feed untokenized predictions.
The untokenized version of our predictions can be found in EASSE, evaluate with easse evaluate -t turkcorpus_test < easse/resources/data/system_outputs/turkcorpus/test/ACCESS
.
So do you mean it's better to use untokenized Turk Corpus and tokenize it later with 13a
to calculate the SARI on WikiLarge?
By the way, by using your command I got 41.381
, not 41.87
🤔
Yes for the future it is better to use raw/untokenized system output and raw/untokenized references from turkcorpus and then feed them to the latest version of EASSE.
The 41.87
can be obtained by using running the https://github.com/facebookresearch/access/blob/master/scripts/evaluate.py
script or using https://github.com/facebookresearch/access/blob/master/system_output/test.pred
system outputs but with the older version of EASSE: pip uninstall easse && pip install easse@git+git://github.com/feralvam/easse.git@090855e73dee5e26ea0cda01d4aa4f51044d9af9
. This was done to match other tokenized system outputs for other models that we reported in the paper. Since then we decided that it is better to use untokenized system outputs and let EASSE do the tokenization (+ some fixes in SARI in EASSE).
The 41.38
is the score that you get when using raw system outputs from ACCESS with latest version of EASSE. It is the score that we reported in our last paper on multilingual simplification: https://arxiv.org/pdf/2005.00352.pdf (Table 3)
Thanks for your answers
Thanks a lot for your paper. I have a question regarding your evaluation script: Why do you tokenize all the data while calculating the SARI score if it's already tokenized? https://github.com/facebookresearch/access/blob/ede11a0fc4bd3e8542c6e6ff49b677a4e0927cbc/access/evaluation/general.py#L28-L31 If the tokenizer is not specified
easse
uses13a
underneath, but it seems the data (including the one inturkcorpus_test_legacy
) is already tokenized.