Incorrect encoding detected in models/gbw_fconv_lm/dict.txt, please rebuild the dataset

OanaMariaCamburu commented 5 years ago

Hi,

I'm trying to use the model trained on Google Billion Words but I get the following error when trying to preprocess an input sentence using the dict.txt from the downloaded zip:

$ python preprocess.py --only-source --testpref data-bin/my_csk/my_csk.test.json --destdir data-bin/my_csk/my_csk_gwb --srcdict models/gbw_fconv_lm/dict.txt Namespace(alignfile=None, destdir='data-bin/my_csk/my_csk_gwb', joined_dictionary=False, nwordssrc=-1, nwordstgt=-1, only_source=True, output_format='binary', padding_factor=8, source_lang=None, srcdict='models/gbw_fconv_lm/dict.txt', target_lang=None, testpref='data-bin/my_csk/my_csk.test.json', tgtdict=None, thresholdsrc=0, thresholdtgt=0, trainpref=None, validpref=None, workers=1) Traceback (most recent call last): File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 169, in load return cls.load(fd) File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 180, in load for line in f.readlines(): File "/data/dgx1/oanuru/anaconda3/envs/fairseq/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 10: invalid start byte


During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "preprocess.py", line 334, in main(args) File "preprocess.py", line 98, in main src_dict = dictionary.Dictionary.load(args.srcdict) File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 177, in load "rebuild the dataset".format(f)) Exception: Incorrect encoding detected in models/gbw_fconv_lm/dict.txt, please rebuild the dataset

I didn't have this problem when using the Wiki dictionary.

Thanks, Oana

alexeib commented 5 years ago

looks like the file is corrupted. the dict is 4.0gb! I'll update the archive, meanwhile here's the dict on its own: https://dl.fbaipublicfiles.com/fairseq/data/gbw/dict.txt

OanaMariaCamburu commented 5 years ago

Thanks!

facebookresearch / fairseq

Incorrect encoding detected in models/gbw_fconv_lm/dict.txt, please rebuild the dataset #437