allenai / bilm-tf

Tensorflow implementation of contextualized word representations from bi-directional language models
Apache License 2.0
1.62k stars 452 forks source link

Get a Unicode Decode error from dump_bilm_embeddings! #152

Closed ahzz1207 closed 5 years ago

ahzz1207 commented 5 years ago

When I just copy the usage_cache.py file to test how to get the embedding vectors from Elmo, I get the error like "UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 0: illegal multibyte sequence".

Traceback (most recent call last): File "c:\Users\loading\Desktop\bilm-tf-master\elmotest.py", line 31, in vocab_file, dataset_file, options_file, weight_file, embedding_file File "c:\Users\loading\Desktop\bilm-tf-master\bilm\model.py", line 649, in dump_bilm_embeddings vocab = UnicodeCharsVocabulary(vocab_file, max_word_length) File "c:\Users\loading\Desktop\bilm-tf-master\bilm\data.py", line 117, in init super(UnicodeCharsVocabulary, self).init(filename, **kwargs) File "c:\Users\loading\Desktop\bilm-tf-master\bilm\data.py", line 29, in init for line in f: UnicodeDecodeError: 'gbk' codec can't decode byte 0xff in position 0: illegal multibyte sequence

bigbizze commented 5 years ago

Check that your input files are correct, I accidentally used one of the binary dump files from training the initial weights for one of my inputs and it threw me incessant decode errors.