Incorporate pre-trained word embeddings into classifier

TromboneDavies / PolarOps

0 stars 0 forks source link

Incorporate pre-trained word embeddings into classifier #22

Open divilian opened 3 years ago

divilian commented 3 years ago

First, do a little reading on what the technique of word embedding is. Google and Wikipedia are as good a places as any to start.
Then, figure out how to make this work with the nltk package. They may provide a way to do it out-of-the-box, or they may incorporate word2vec or "glove" or some other framework. I'm sure people have done this with nltk and have done it a lot, so it's just a matter of digging around and figuring out how to make it work for us.

divilian commented 3 years ago

Look specifically for sample code that uses word2vec in a text classification setting.

divilian commented 3 years ago

Is using pre-trained embeddings a good option for us? Pros and cons? Etc.

divilian commented 3 years ago

From what I've been reading, it sounds like computing your own word embeddings (as opposed to using a pre-trained set) is really only viable if you have a great deal of training data. Since we don't (yet), I think we're going to have to use pre-trained. So I want to slightly change what this Issue is (or we can create a new one if you'd rather) to be: "google around for pre-trained word embedding vector data sets that are publicly available, and try to find one that seems appropriate for Reddit comments."

divilian commented 3 years ago

Figure out whether the "10 data set" word embeddings are word2vec or GloVe.
Actually get word embeddings downloaded and installed (?) and integrated into the classifier.

vgcagle commented 3 years ago

Link to where I found "10 data set" word embeddings https://datasetsearch.research.google.com/search?query=Word%20Embeddings&docid=L2cvMTFqOWMzeDFsMA%3D%3D

Link to glove Twitter embeddings https://www.kaggle.com/jdpaletto/glove-global-vectors-for-word-representation

divilian commented 3 years ago

@vgcagle: add Li paper to Zotero @rockladyeagles: help @vgcagle get gensim installed