Expand functionality to different word embedding files

bnosac / word2vec

Distributed Representations of Words using word2vec

Apache License 2.0

70 stars 5 forks source link

Expand functionality to different word embedding files #10

Open dafnevk opened 3 years ago

dafnevk commented 3 years ago

Although there is a read.wordvectors function that can read in a plan text file with vectors, the predict.word2vec function only works on 'model' objects, that can not be created from these word vector files.

Would it be possible to have the predict.word2vec function work on only the embedding matrix? This way, it would be possible to use it for all types of word vector models, e.g. trained with fasttext.

jwijffels commented 3 years ago

predict.word2vec is exactly the same as function word2vec_similarity, which you can apply on 2 embedding matrices or vectors.

That will work on embeddings trained with this package as training is optimised for that similarity
but this might not be what you want if you have embeddings trained in another framework.

That being said apply word2vec_similarity and see if it works for your embeddings

jwijffels commented 3 years ago

Note that if you need embedding models with subwords, you might as well use sentencepiece_download_model from the sentencepiece R package. This downloads sentencepiece tokenizers alongside the embedding model trained on wikipedia. Compatible with this R package