artetxem / vecmap

A framework to learn cross-lingual word embedding mappings
GNU General Public License v3.0
645 stars 130 forks source link

Unicode error at line #31 in embeddings.py #23

Open sawan16 opened 5 years ago

sawan16 commented 5 years ago

UnicodeEncodeError: 'utf-8' codec can't encode character '\udcf6' in position 0: surrogates not allowed

artetxem commented 5 years ago

This obviously looks like an encoding problem, but I would need more details to know where it happens. Please report the full stack trace.

SouravDutta91 commented 5 years ago

Sometimes 'utf-8' encoding faces errors while encoding/decoding certain symbols or letters. In those cases, you can either try to ignore such errors by adding errors = 'ignore' with the encoding, or else maybe try some other specific encoding type like latin-1 or ISO-8859-1 for example. Hope this helps.

suman101112 commented 3 years ago

The input embed model is not in correct format. Use model.save_word2vec_format(filename) to save the fasttext or word2vec model.