TromboneDavies / PolarOps

0 stars 0 forks source link

Incorporate pre-trained word embeddings into classifier #22

Open divilian opened 3 years ago

divilian commented 3 years ago
  1. First, do a little reading on what the technique of word embedding is. Google and Wikipedia are as good a places as any to start.
  2. Then, figure out how to make this work with the nltk package. They may provide a way to do it out-of-the-box, or they may incorporate word2vec or "glove" or some other framework. I'm sure people have done this with nltk and have done it a lot, so it's just a matter of digging around and figuring out how to make it work for us.
divilian commented 3 years ago

Look specifically for sample code that uses word2vec in a text classification setting.

divilian commented 3 years ago

Is using pre-trained embeddings a good option for us? Pros and cons? Etc.

divilian commented 3 years ago

From what I've been reading, it sounds like computing your own word embeddings (as opposed to using a pre-trained set) is really only viable if you have a great deal of training data. Since we don't (yet), I think we're going to have to use pre-trained. So I want to slightly change what this Issue is (or we can create a new one if you'd rather) to be: "google around for pre-trained word embedding vector data sets that are publicly available, and try to find one that seems appropriate for Reddit comments."

divilian commented 3 years ago
  1. Figure out whether the "10 data set" word embeddings are word2vec or GloVe.
  2. Actually get word embeddings downloaded and installed (?) and integrated into the classifier.
vgcagle commented 3 years ago

Link to where I found "10 data set" word embeddings https://datasetsearch.research.google.com/search?query=Word%20Embeddings&docid=L2cvMTFqOWMzeDFsMA%3D%3D

Link to glove Twitter embeddings https://www.kaggle.com/jdpaletto/glove-global-vectors-for-word-representation

divilian commented 3 years ago

@vgcagle: add Li paper to Zotero @rockladyeagles: help @vgcagle get gensim installed