Closed Priya22 closed 6 years ago
You found some fix? maybe model.build_vocab(sentences, tokenize=True) helps?
EDIT: I have the same error, I tried to build vocab from the sentences and then model.update_vocab('') but didn't work, so I'm not sure what to do.
I had this error using FastText vectors too.
I did this, but it isn't a great fix:
for i in range(len(batch)):
for j in range(len(batch[i])):
# this next line here is my change
if batch[i][j] != self.eos:
embed[j, i, :] = self.word_vec[batch[i][j]]
The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt
. Probably </s>
is missing in the others; maybe increasing the vocab size would help.
The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.
It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).
What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).
@nlothian Ah I see. Thanks!
Hi, In demo.ipynb, if you use infersent1.pkl please use the standard GloVe vectors (because the LSTM has been trained with these ones), and use the fastText common-crawl vectors for infersent2.pkl. It is also important to specify the version in "params_model". Version is "1" for infersent1.pkl and "2" for infersent2.pkl. Thanks Alexis
Hi all,
This issue is due to the prepare_samples
method (line 193-201 of models.py), as follows:
# filters words without w2v vectors
for i in range(len(sentences)):
s_f = [word for word in sentences[i] if word in self.word_vec]
if not s_f:
import warnings
warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
Replacing by "</s>"..' % (sentences[i], i))
s_f = [self.eos]
sentences[i] = s_f
the end of sentence token would be used to represent the whole sentence if all tokens in the sentence do not have corresponding vectors.
As @nlothian suggests, the simplest solution might be setting vector for EOS to mean or zero vector. Here, I add the following very short snippet to get_w2v
method, before returning word_vec
:
if self.eos not in word_vec:
word_vec[self.eos] = np.mean(np.stack(word_vec.values(), axis=0), axis=0)
It should also work for fasttext, i.e. version 2.
I'm trying to run the demo.ipynb notebook in the encoder module, with 300 dimensional GloVe vectors. I've run all the commands as detailed in the Readme and the notebook, but at the
model.encode
command I get an error as follows:Should I explicitly add the symbol to the word vector file? Thanks!