bmschmidt / wordVectors

An R package for creating and exploring word2vec and other word embedding models
Other
280 stars 77 forks source link

n-grams greater than 2 #50

Open lawest59 opened 6 years ago

lawest59 commented 6 years ago

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath, "docs.txt", lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt", output="./stat_vecs.bin", threads=detectCores(), vectors=100, window=7, force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

bmschmidt commented 6 years ago

Interesting problem. This is a documentation issue, I think. I can't test right now, but suspect the reason this happens is that 'bundle_ngrams' just runs the bigram code multiple times. If 'so_that' and 'they_can' are identified as common bigrams in the first run, they can be grouped together in the second run as four words even though the code implies only three.

On Fri, Jul 20, 2018, 3:23 PM lawest59 notifications@github.com wrote:

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath, "docs.txt", lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt", output="./stat_vecs.bin", threads=detectCores(), vectors=100, window=7, force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bmschmidt/wordVectors/issues/50, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDy5kOIVz6K8ByHrK29U24YuJTbBJcMks5uIi44gaJpZM4VZA5v .