facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 471 forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 1387: ordinal not in range(128) #98

Open dougc333 opened 5 years ago

dougc333 commented 5 years ago

Modify line 42 data.py from with open(glove_path) as f: to with open(glove_path, encoding="utf-8") as f:

dougc333 commented 5 years ago

'1.0.0.dev20181114' = pytorch version

dougc333 commented 5 years ago

above modification fixed unicodedecode error

zhantong526 commented 5 years ago

Hi dougc333

I am still facing UnicodeDecode Error after modifying line 42 in data.py. Can you tell me how to switch '1.0.0.dev20181114' = pytorch version?

Traceback (most recent call last):
  File ".\working_03_gui_QA.py", line 230, in <module>
    main()
  File ".\working_03_gui_QA.py", line 226, in main
    d=guiclass(root)
  File ".\working_03_gui_QA.py", line 51, in __init__
    self._c=InferSentClass()
  File "C:\Users\zhant\OneDrive\Desktop\2911\System7\Sent_embed.py", line 28, in __init__
    model.build_vocab_k_words(K=100000)
  File "C:\Users\zhant\OneDrive\Desktop\2911\InferSent\models.py", line 146, in build_vocab_k_words
    self.word_vec = self.get_w2v_k(K)
  File "C:\Users\zhant\OneDrive\Desktop\2911\InferSent\models.py", line 124, in get_w2v_k
    for line in f:
  File "C:\Users\zhant\Anaconda3\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7674: character maps to <undefined>

Thanks in advance

maulikmadhavi commented 5 years ago

Use encoding="utf-8" in two files and three places:

  1. Change this with open(self.w2v_path) as f: to with open(self.w2v_path, encoding="utf-8") as f: line 110, inside get_w2v in models.py

  2. Change this with open(self.w2v_path) as f: to with open(self.w2v_path, encoding="utf-8") as f: line 123, inside get_w2v_k in models.py

  3. Change this with open(glove_path) as f: to with open(glove_path, encoding="utf-8") as f: line 42, inside get_glove in data.py

mohit6522 commented 4 years ago

for me it was(On Mac) /Users/user1/Library/Python/2.7/lib/python/site-packages/backports/configparser/init.py.

I was using pip version of apache-airflow on python 2.7

In this, update the read function with def read(self, filenames, encoding="utf-8"): initially it will be something like: def read(self, filenames, encoding=None):

Krtonia commented 2 years ago

Use encoding="utf-8" in two files and three places:

  1. Change this with open(self.w2v_path) as f: to with open(self.w2v_path, encoding="utf-8") as f: line 110, inside get_w2v in models.py
  2. Change this with open(self.w2v_path) as f: to with open(self.w2v_path, encoding="utf-8") as f: line 123, inside get_w2v_k in models.py
  3. Change this with open(glove_path) as f: to with open(glove_path, encoding="utf-8") as f: line 42, inside get_glove in data.py

where to find these files