Helsinki-NLP / MuCoW

Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation
Creative Commons Attribution 4.0 International
16 stars 1 forks source link

Tokenization process in LREC version? #1

Open simtony opened 3 years ago

simtony commented 3 years ago

The test set and training set are pre-tokenized and no description about the tokenization process is provided. Tokenization affects both the performance of off-the-shell parser and BLEU computation. It would be helpful for rigorous research to supply the tokenization script, or a detokenization, or a un-tokenized version of train&test set.

raganato commented 3 years ago

to detokenize the data, you can use the detokenizer script from the moses project. Here is the link: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/detokenizer.perl