n-grams greater than 2 - Githubissues

Interesting problem. This is a documentation issue, I think. I can't test right now, but suspect the reason this happens is that 'bundle_ngrams' just runs the bigram code multiple times. If 'so_that' and 'they_can' are identified as common bigrams in the first run, they can be grouped together in the second run as four words even though the code implies only three.

On Fri, Jul 20, 2018, 3:23 PM lawest59 notifications@github.com wrote:

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath, "docs.txt", lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt", output="./stat_vecs.bin", threads=detectCores(), vectors=100, window=7, force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bmschmidt/wordVectors/issues/50, or mute the thread https://github.com/notifications/unsubscribe-auth/ABDy5kOIVz6K8ByHrK29U24YuJTbBJcMks5uIi44gaJpZM4VZA5v .

bmschmidt / wordVectors

n-grams greater than 2 #50