JunjieHu / dali

Domain Adaptation of Neural Machine Translation by Lexicon Induction
20 stars 5 forks source link

Explaining the dataset files #2

Closed jayelm closed 4 years ago

jayelm commented 4 years ago

Hi, can you explain what the dataset files mean? in particular I'm not sure about the difference between bpe.clean.en and bpe.en (and why clean exists only for the training splits)

JunjieHu commented 4 years ago

Hi @jayelm The bpe.clean.{de,en} files are further filtered by the mosedecoder's clean script, and the bpe.{de,en} files are the original files before filtering. This is the conventional practice for only the training set to avoid training on super long sentences or sentence pairs with wired length ratio.

jayelm commented 4 years ago

Thanks!