bnosac / word2vec

Distributed Representations of Words using word2vec
Apache License 2.0
70 stars 5 forks source link

How to add words to dictionary #2

Closed ahmoreira closed 3 years ago

ahmoreira commented 3 years ago

Thank you for developing this package! I am trying to make predictions using the model built with the function word2vec. However, the keywords I am interested in are not part of the dictionary. So when trying the following line:

lookslike_in <- predict(model_in, c("invertebrates", "macrofauna", "meiofauna", "meiobentho"), type = "nearest", top_n = 5)

I get the error message: Error in w2v_nearest(object$model, x = x, top_n = top_n, ...) : Could not find the word in the dictionary: macrofauna

My question is: how can I add words to the dictionary?

Thank you in advance, Hadassa

jwijffels commented 3 years ago

You can't add words to the dictionary as you need text where the models has learned the meaning of the word 'macrofauna'. Did you learn it on texts where this word was available in? As an alternative, you could first tokenize it using e.g byte pair encoding with either R packages tokenizers.bpe or sentencepiece, collapse this tokenized text back together, build the word2vec model and get the nearest neighbours to the BPE tokenized texts.

ahmoreira commented 3 years ago

Thank you for your answer! I see what you mean! Indeed, the word 'macrofauna' was not included in the text that I used to train the model. I want to search which words are associated with the keywords that were used to search the literature. Now I understand that not all keywords appear in the abstracts. I think I first need to check which keywords appear in the abstracts and then only use these ones with the predict function. I will also have a look at your suggestion to tokenize the words! Thank you very much!!