aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.31k stars 337 forks source link

Unicode Decode Error when getting key from embeddings #37

Closed tgalery closed 8 years ago

tgalery commented 8 years ago

I'm doing some word lookups for portuguese and I got the following:

File "/home/intruder/source/tgalery/analytyca/analytyca/utils/context.py", line 9, in get_vector
    vector = embeddings[word_key]
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/embeddings.py", line 40, in __getitem__
    return self.vectors[self.vocabulary[k]]
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 29, in __getitem__
    return self.approximate_ids(key)
  File "/usr/local/lib/python2.7/dist-packages/polyglot/mapping/expansion.py", line 52, in approximate_ids
    raise KeyError("{} not found".format(key))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 5: ordinal not in range(128)

Because the string to be formated is not compatible with the incoming unicode key the Key Error throws another exception.

I'm happy to fix this, but I wonder whether the keys are meant to be in binary format for lookups. Let me know how best to proceed.

aboSamoor commented 8 years ago

The key should be always in unicode. Can you please send me a PR.

tgalery commented 8 years ago

Looks like I had an old version of the repo, the like on master already has the message in unicode. Closing this issue.