facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Inconsistent results of IWSLT data preprocessing between fairseq 0.6 and 0.9 #2787

Closed haolibai closed 4 years ago

haolibai commented 4 years ago

❓ Questions and Help

What is your question?

I am running two versions of faiseq for neural machine translation, one is 0.6 and the other is 0.9, and find the data preprocessing results of IWSLT (for NMT) are inconsistent. I prepare the dataset following the guide in ./examples/translation/README.md, which includes:

1) ./examples/translation/prepare-iwslt14.sh (for IWSLT-14) 2) data binarization.

But I find the results of two fairseq versions are different. For example: 1) IWSLT fairseq-0.6 produces 42MB binarized data, but 21MB for fairseq 0.9.

IWSLT processed by fairseq 0.6: 42MB image

IWSLT processed by fairseq 0.9: 21MB image

myleott commented 4 years ago

This is expected. Newer versions of fairseq use a more efficient mmap data format that requires less disk space. See this PR: https://github.com/pytorch/fairseq/pull/816