bnosac / word2vec

Distributed Representations of Words using word2vec
Apache License 2.0
70 stars 5 forks source link

Using pre-trained vectors with word2vec #3

Closed luciebaudoin closed 3 years ago

luciebaudoin commented 3 years ago

Hi,

Although you mention it's a possibility, I can't find a clear code on how to use a downloaded pre-trained model on a local corpus of text with the R word2vec package. Can you help me with that? Thank you!

jwijffels commented 3 years ago

There are examples at http://www.bnosac.be/index.php/blog/100-word2vec-in-r

luciebaudoin commented 3 years ago

Thanks for your quick reply! I've been through this code before and that doesn't solve my problem. I understood how to load a pretrained vector, but when I look for predictions, I can only see the predictions from that pre trained vector regardless of my local corpus.

i.e. doing: model <- read.word2vec(file = "/model.bin", normalize = TRUE) and then l1 <- predict(model, newdata = c("word"), type = "nearest", top_n = 10)

This doesn't make the pre-trained model work on my local corpus of text. It gives me the predictions pre-existing in that model.

I'm looking to do something similar to the chrono_train function in word2vec on Python in R and can't figure out how.

Again, thanks for your invaluable help.

jwijffels commented 3 years ago

What do you mean with chrono_train? Where is that defined?

luciebaudoin commented 3 years ago

Sorry if I am not being clear. I want to model a corpus by initializing with the vectors of a previous model.
This will allow me to do chronologically trained vectors such as in the following article from Emma Rodman, but on R: https://static1.squarespace.com/static/5ca7d04ea09a7e68ba44e707/t/5cda219af4e1fc94236bc0cf/1557799325771/Diachronic_Word_Vectors___Political_Analysis_Final_Version.pdf

On Python, this "chrono_train" function looks like this: def chrono_train(n_iterations, current_corpus, previous_model, output_model): for k in range(n_iterations): sentence_samples = resample(current_corpus) model = Word2Vec.load(previous_model) run = k+1 model.save(output_model)

I hope this is a bit clearer... I'm still new to this to there's a lot of trial and error.

jwijffels commented 3 years ago

There is no option in this R package to train on your own corpus starting from an initial set of word vectors. This package

  1. allows training word vectors from scratch on your own corpus
  2. getting the word vectors from a pretrained set for your own corpus

Both of these use cases are shown at http://www.bnosac.be/index.php/blog/100-word2vec-in-r If you want to do transfer learning (keep on training starting from an existing set of word vectors), there is functionality of this implemented in R package ruimtehol - see section '5. Transfer learning' at https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf

luciebaudoin commented 3 years ago

Thank you very much!

jwijffels commented 3 years ago

BTW. If you plan to do Procrustes matrix alignment. Feel free to share your code. I'll be interested in this as well.