Closed zoharai closed 6 years ago
All of the data we used is simply pre-processed by tokenization and BPE. For tokenization, we used the script in Mose. For BPE, we used the released subword-toolkit. We did not do anything else for pre-processing.
Thanks!
Hi, Another question regarding the BPE: Did you use learn_bpe.py script or learn_joint_bpe_and_vocab.py? And, did you use the default number of merge operations (10000) or did you change it?
Thanks! Zohar
we used the learn_bpe.py and the number for merge operations is set as 30000.
Hi, Can you upload your pre-processing script or add information about which scripts did you use? For example, did you use true-casing, tokenization, removing weird characters, checking the sentence alignments ?
Thanks!