I'm trying to use the model trained on Google Billion Words but I get the following error when trying to preprocess an input sentence using the dict.txt from the downloaded zip:
$ python preprocess.py --only-source --testpref data-bin/my_csk/my_csk.test.json --destdir data-bin/my_csk/my_csk_gwb --srcdict models/gbw_fconv_lm/dict.txt
Namespace(alignfile=None, destdir='data-bin/my_csk/my_csk_gwb', joined_dictionary=False, nwordssrc=-1, nwordstgt=-1, only_source=True, output_format='binary', padding_factor=8, source_lang=None, srcdict='models/gbw_fconv_lm/dict.txt', target_lang=None, testpref='data-bin/my_csk/my_csk.test.json', tgtdict=None, thresholdsrc=0, thresholdtgt=0, trainpref=None, validpref=None, workers=1)
Traceback (most recent call last):
File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 169, in load
return cls.load(fd)
File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 180, in load
for line in f.readlines():
File "/data/dgx1/oanuru/anaconda3/envs/fairseq/lib/python3.7/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 10: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "preprocess.py", line 334, in
main(args)
File "preprocess.py", line 98, in main
src_dict = dictionary.Dictionary.load(args.srcdict)
File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 177, in load
"rebuild the dataset".format(f))
Exception: Incorrect encoding detected in models/gbw_fconv_lm/dict.txt, please rebuild the dataset
I didn't have this problem when using the Wiki dictionary.
Hi,
I'm trying to use the model trained on Google Billion Words but I get the following error when trying to preprocess an input sentence using the dict.txt from the downloaded zip:
$ python preprocess.py --only-source --testpref data-bin/my_csk/my_csk.test.json --destdir data-bin/my_csk/my_csk_gwb --srcdict models/gbw_fconv_lm/dict.txt Namespace(alignfile=None, destdir='data-bin/my_csk/my_csk_gwb', joined_dictionary=False, nwordssrc=-1, nwordstgt=-1, only_source=True, output_format='binary', padding_factor=8, source_lang=None, srcdict='models/gbw_fconv_lm/dict.txt', target_lang=None, testpref='data-bin/my_csk/my_csk.test.json', tgtdict=None, thresholdsrc=0, thresholdtgt=0, trainpref=None, validpref=None, workers=1) Traceback (most recent call last): File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 169, in load return cls.load(fd) File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 180, in load for line in f.readlines(): File "/data/dgx1/oanuru/anaconda3/envs/fairseq/lib/python3.7/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 10: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "preprocess.py", line 334, in
main(args)
File "preprocess.py", line 98, in main
src_dict = dictionary.Dictionary.load(args.srcdict)
File "/raid/data/oanuru/my_fairseq/my_fairseq/fairseq/data/dictionary.py", line 177, in load
"rebuild the dataset".format(f))
Exception: Incorrect encoding detected in models/gbw_fconv_lm/dict.txt, please rebuild the dataset
I didn't have this problem when using the Wiki dictionary.
Thanks, Oana