facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 471 forks source link

Unicode Decode Error #95

Open DrappierTechnologies opened 5 years ago

DrappierTechnologies commented 5 years ago

I'm receiving an error when following the encoder demo notebook

I'm using fastText instead of gLoVe on a Windows 10 machine.


UnicodeDecodeError Traceback (most recent call last)

in () 1 # Load embeddings of K most frequent words 2 ----> 3 model.build_vocab_k_words(K=100000) D:\...\...\...\...\project\InferSent\models.py in build_vocab_k_words(self, K) 143 def build_vocab_k_words(self, K): 144 assert hasattr(self, 'w2v_path'), 'w2v path not set' --> 145 self.word_vec = self.get_w2v_k(K) 146 print('Vocab size : %s' % (K)) 147 D:\...\...\...\...\project\InferSent\models.py in get_w2v_k(self, K) 121 word_vec = {} 122 with open(self.w2v_path) as f: --> 123 for line in f: 124 word, vec = line.split(' ', 1) 125 if k <= K: ~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final) 21 class IncrementalDecoder(codecs.IncrementalDecoder): 22 def decode(self, input, final=False): ---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0] 24 25 class StreamWriter(Codec,codecs.StreamWriter): UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7674: character maps to A quick search reveals a possible encoding issue at [this line](https://github.com/facebookresearch/InferSent/blob/8aaaf40d338286a4ed7ad4d36a7dc26369aef605/models.py#L90) but I'm not entirely certain.
DrappierTechnologies commented 5 years ago

Nevermind, installing Visual Studio C++ Build Tools 2015 and pip installing fasttext fixed this problem.

DrappierTechnologies commented 5 years ago

Alright, I take that back on windows there seems to be an issue specifying open() without the encoding type. PR not incoming. Please fix.

chuzhifeng commented 5 years ago

hi, I get same question with you ,but I used GloVe on windows 7,do you fix it ?,and follow is my error

UnicodeDecodeError Traceback (most recent call last)

in () ----> 1 infersent.build_vocab(sentences, tokenize=True) D:\Code\jupyter\SQuAD-master\InferSent\models.py in build_vocab(self, sentences, tokenize) 137 assert hasattr(self, 'w2v_path'), 'w2v path not set' 138 word_dict = self.get_word_dict(sentences, tokenize) --> 139 self.word_vec = self.get_w2v(word_dict) 140 print('Vocab size : %s' % (len(self.word_vec))) 141 D:\Code\jupyter\SQuAD-master\InferSent\models.py in get_w2v(self, word_dict) 108 word_vec = {} 109 with open(self.w2v_path) as f: --> 110 for line in f: 111 word, vec = line.split(' ', 1) 112 if word in word_dict: UnicodeDecodeError: 'gbk' codec can't decode byte 0xa2 in position 1389: illegal multibyte sequence
DrappierTechnologies commented 5 years ago

@chuzhifeng Unfortunately, this repo isn't open for PRs so we can't really do much but workaround this issue. As a workaround I modified the models.py at line 109 & 122, both which read with open(self.w2v_path) as f:, were changed to read with open(self.w2v_path, encoding="utf-8") as f:.

chuzhifeng commented 5 years ago

yeah,when I changed this code,it can run,thanks

aconneau commented 5 years ago

Are you using python3? The solution proposed by Drappier will be usable only for python3 users but it's the workaround indeed.

wildauwil commented 5 years ago

i already change the code but still error. any idea to fix it? thanks before