Open AndreiBarsan opened 7 years ago
For the CIL project I used word2vec embeddings trained on the text itself (in this case the questions themselves) and worked quite well
Awesome! I think we could try both, I guess. Gensim FTW! I'm only a bit worried that training locally may not be too helpful because many questions don't really offer that much co-occurrence information between words (as opposed to training on a Wikipedia dump, for instance).
You can also try to make the embeddings not fixed like now but add a lookup layer which retrains word embeddings. But this blows up parameters and training time, but I guess it is worth a shot. See also: https://github.com/fchollet/keras/issues/853
@bernhard2202 It may still help if we also train the embeddings. But perhaps we could leave it as low-priority.
Oh, and regarding GloVe, their changelog mentions that "To reduce memory usage and loading time, we've trimmed the vocabulary down to 1m entries.", which means we may be able to squeeze a little bit more accuracy by using the untrimmed (1.9M entry OR even the 2.2M entry one) vocabulary from http://nlp.stanford.edu/projects/glove/. This should probably be low-priority anyway.
And yes, it seems that when the blogpost was originally written, spacy was not using GloVe, but they switched to using GloVe vectors in the meantime.
We're currently using 300-d Goldberg and Levy 2014 embeddings, the default from the
spacy
library the original blogpost was using (https://avisingh599.github.io/deeplearning/visual-qa/).The author of that post mentioned that using GloVe embeddings should significantly improve accuracy.