bnosac / doc2vec

Distributed Representations of Sentences and Documents
Other
46 stars 5 forks source link

Feature request : possibility to use a pretrained word vector as starting point for doc2vec #13

Closed dominiqueemmanuel closed 3 years ago

dominiqueemmanuel commented 3 years ago

Hi,

And thank you to bring doc2vec to R !!

This issue is a feature request: do you think it would be possible to allow doc2vec algorithm to use pretrained word vectors?

It would be interesting for instance if you have learned word vectors on a large corpus, and then you would like to use this as a starting point for the doc2vec algorithm on a smaller corpus.

I hope this is clear and appropriate.

Kind regards, Dominique

jwijffels commented 3 years ago

You can already do that in the ruimtehol R package. See the transfer learning example at https://cran.r-project.org/web/packages/ruimtehol/vignettes/ground-control-to-ruimtehol.pdf This package does not provide this functionality but if you can make a pull request, I would be glad to incorporate it.

dominiqueemmanuel commented 3 years ago

Thank you for your answer !

In fact I was just comparing the two methods/packages.

And it seems to me that doc2vec is faster than ruimtehol (at least with the settings I've chosen on my benchmark).

Regarding the pull request, I guess I should go into the C++ code... Unfortunately, I'm fluent only in R.

Best regards, Dominique

jwijffels commented 3 years ago

The constructor where the initialisations happens is at https://github.com/bnosac/doc2vec/blob/master/src/doc2vec/NN.cpp#L4 Indeed C++. Go for it.

jwijffels commented 3 years ago

Implemented this in commit https://github.com/bnosac/doc2vec/commit/2bf5542005998d61d4ac5b9b7476ff49eb458100 feel free to test out.