How to handle out of vocabulary words while using pretrained embeddings?

itsasha commented 5 years ago

Hi everyone,

I would like to use German align vectors to train a crosslingual classification model (https://fasttext.cc/docs/en/aligned-vectors.html). If I got it right, wiki.de.align.vec is the file I need? I tried to implement it using word2vec_format, but ran into a problem of out of vocabulary words.

I know that this problem has been solved by fasttext. It obtaines semantic similar word vectors by breaking words into ngrams and it sounds awesome. To load the fasttext model I need a .bin file, which is not provided for aligned vectors. Do you have any ideas of solving the problem?

Thank you!

PrashantRanjan09 commented 5 years ago

@itsasha : here they have mentioned that once you have file (text properly processed), the script dumps two files (.bin and .vec). That should solve your problem. I believe your text file has some headers or is not in right format.

usmaann commented 5 years ago

Hi Prashant, Nice job Mate....I am looking for some support regarding your improved word embeddings repo. any email or something where i can connect?

PrashantRanjan09 / WordEmbeddings-Elmo-Fasttext-Word2Vec

How to handle out of vocabulary words while using pretrained embeddings? #1