importing bag of words data into gensim

BIDS-collaborative / destress

Helping @peparedes with text analysis of livejournal data

ISC License

7 stars 2 forks source link

importing bag of words data into gensim #35

Closed xih closed 7 years ago

xih commented 9 years ago

@davclark @geneyoo @lambdaloop

Hey Dav

We're trying to import our masterdict into python so that we can train it with gensim. Pierre is saving the data into a scipy sparse matrix.

To train the data using word2vec on Gensim it requires sentences / words as input. Is there anyway I can feed it a sparse matrix ?

Cheers

davclark commented 9 years ago

word2vec is based on "skip-grams" - i.e., the distance between words as they occur in linear text. If I understand correctly, you are getting a bag-of-words representation from @lambdaloop, and this will not be usable as a training sample for word2vec.

However, it is not clear that you need to train a new model for a first pass. Rather, you should be able to use a pre-existing model, e.g. from google - note that Google uses Google Drive which makes the download require browser interaction (i.e., don't try to use wget or curl). Probably if you do that, you should use the Google News model. Word2vec can just read that in with Word2Vec.load_word2vec_format('GoogleNews-vectors-negative300.bin.gz') - read more in the docs.

At this point, you can map the sparse matrices to words, or you could modify the model by replacing the words with numbers. Probably the first would be fine (it will probably be more straight-forward), but the latter is almost certainly going to be faster.

davclark commented 9 years ago

Also - please do not commit that file to the git repo! You should add it to .gitignore.

lambdaloop commented 9 years ago

Hi Dav! Ah I see, I thought that it was only based on a bag of words model of words.

We actually did use the Google News model as a first pass! You can see some initial results here: https://github.com/berkeley-dsc/destress/issues/33

But we thought we may get some benefit from training on the more informal livejournal data instead. I'll provide actual sentences with proper word order then.

davclark commented 9 years ago

Right - now that you mention it, I remember that :\

Note that while gensim will support continued training of it's models, the models from Google don't (I think) have all the necessary data to allow continued training. But you should check around if you decide to take a hybrid approach.

geneyoo commented 9 years ago

I think we decided to just try to make another copy of our data in a .txt file with 1 sentence per newline. That seems to be the input that Gensim wants.

geneyoo commented 9 years ago

It's looks like my gensim implementation w/ pierre's generated text files are working. It's currently building the model/training on Mercury.

oh and, @davclark , I was prompted to "update" on mercury so I typed the command it gave me... and then it said that because I don't have sudo access, it would report me. Just thought I should let you know.

lambdaloop commented 9 years ago

Tell us when it's done, so that @xih can recreate a bidmach word2vec matrix, and I can plug it in to my query program.

davclark commented 9 years ago

@geneyoo - don't worry about system updates. I'll take care of that. Avoiding it right now, as we don't want the system to change!