Preprocessing of en2de data

ZhenYangIACAS / NMT_GAN

generative adversarial nets for neural machine translation

Apache License 2.0

119 stars 37 forks source link

Preprocessing of en2de data #15

Closed zoharai closed 6 years ago

zoharai commented 6 years ago

Hi, Can you upload your pre-processing script or add information about which scripts did you use? For example, did you use true-casing, tokenization, removing weird characters, checking the sentence alignments ?

Thanks!

ZhenYangIACAS commented 6 years ago

All of the data we used is simply pre-processed by tokenization and BPE. For tokenization, we used the script in Mose. For BPE, we used the released subword-toolkit. We did not do anything else for pre-processing.

zoharai commented 6 years ago

Thanks!

zoharai commented 6 years ago

Hi, Another question regarding the BPE: Did you use learn_bpe.py script or learn_joint_bpe_and_vocab.py? And, did you use the default number of merge operations (10000) or did you change it?

Thanks! Zohar

ZhenYangIACAS commented 6 years ago

we used the learn_bpe.py and the number for merge operations is set as 30000.