Inconsistent results of IWSLT data preprocessing between fairseq 0.6 and 0.9

❓ Questions and Help

What is your question?

I am running two versions of faiseq for neural machine translation, one is 0.6 and the other is 0.9, and find the data preprocessing results of IWSLT (for NMT) are inconsistent. I prepare the dataset following the guide in ./examples/translation/README.md, which includes:

1) ./examples/translation/prepare-iwslt14.sh (for IWSLT-14) 2) data binarization.

But I find the results of two fairseq versions are different. For example: 1) IWSLT fairseq-0.6 produces 42MB binarized data, but 21MB for fairseq 0.9.

IWSLT processed by fairseq 0.6: 42MB

IWSLT processed by fairseq 0.9: 21MB

fairseq Version: fairseq 0.6 & fairseq 0.9
PyTorch Version 1.5.1
How you installed fairseq (pip, source): Yes. pip install . --editable
Python version: 3.7.4

facebookresearch / fairseq

Inconsistent results of IWSLT data preprocessing between fairseq 0.6 and 0.9 #2787

❓ Questions and Help

What is your question?