allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Unicode Decode Error #150

Closed AlonAizescu closed 5 years ago

AlonAizescu commented 5 years ago

Hi, I tried to run the "1 Billion Word Benchmark" example and I got the following error message:

C:\Users\Alon\Anaconda3\lib\site-packages\h5py__init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type. from ._conv import register_converters as _register_converters Traceback (most recent call last): File "bilm-tf-master/bin/train_elmo.py", line 73, in main(args) File "bilm-tf-master/bin/train_elmo.py", line 12, in main vocab = load_vocab(args.vocab_file, 50) File "C:\Users\Alon\Anaconda3\lib\site-packages\bilm-0.1.post5-py3.6.egg\bilm\training.py", line 1060, in load_vocab validate_file=True) File "C:\Users\Alon\Anaconda3\lib\site-packages\bilm-0.1.post5-py3.6.egg\bilm\data.py", line 117, in init super(UnicodeCharsVocabulary, self).init(filename, **kwargs) File "C:\Users\Alon\Anaconda3\lib\site-packages\bilm-0.1.post5-py3.6.egg\bilm\data.py", line 29, in init__ for line in f: File "C:\Users\Alon\Anaconda3\lib\encodings\cp1255.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 1: character maps to

zxy951005 commented 5 years ago

my error message is: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 615: invalid start byte

TheresaSchmidt commented 4 years ago

It helped me to specify encoding in the opening function (ln. 27) of the file: with open(filename, encoding="utf-8") as f: instead of with open(filename) as f:

ruleGreen commented 4 years ago

Is this solved? when I run the test script, I also get the unicode deocde error.

ruleGreen commented 4 years ago

export LC_ALL= en_US.UTF-8 This can solve this problem in my case.

TheresaSchmidt commented 4 years ago

export LC_ALL= en_US.UTF-8 This can solve this problem in my case.

The country / language code would have to be different for each language, wouldn’t it?