facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 470 forks source link

InferSent encoder demo with GloVe - Key Error #81

Closed Priya22 closed 6 years ago

Priya22 commented 6 years ago

I'm trying to run the demo.ipynb notebook in the encoder module, with 300 dimensional GloVe vectors. I've run all the commands as detailed in the Readme and the notebook, but at the model.encode command I get an error as follows:

KeyError                                  Traceback (most recent call last)
<ipython-input-36-3fb4b1a1a3f7> in <module>()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
      2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
    220         for stidx in range(0, len(sentences), bsize):
    221             batch = Variable(self.get_batch(
--> 222                         sentences[stidx:stidx + bsize]), volatile=True)
    223             if self.is_cuda():
    224                 batch = batch.cuda()

/ais/hal9000/vkpriya/InferSent-master/encoder/models.py in get_batch(self, batch)
    172         for i in range(len(batch)):
    173             for j in range(len(batch[i])):
--> 174                 embed[j, i, :] = self.word_vec[batch[i][j]]
    175 
    176         return torch.FloatTensor(embed)

KeyError: </s>

Should I explicitly add the symbol to the word vector file? Thanks!

set92 commented 6 years ago

You found some fix? maybe model.build_vocab(sentences, tokenize=True) helps?

EDIT: I have the same error, I tried to build vocab from the sentences and then model.update_vocab('') but didn't work, so I'm not sure what to do.

nlothian commented 6 years ago

I had this error using FastText vectors too.

I did this, but it isn't a great fix:

        for i in range(len(batch)):
            for j in range(len(batch[i])):
                # this next line here is my change
                if batch[i][j] != self.eos:
                    embed[j, i, :] = self.word_vec[batch[i][j]]
Priya22 commented 6 years ago

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably </s> is missing in the others; maybe increasing the vocab size would help.

nlothian commented 6 years ago

The error doesn't appear if I use the exact GloVe vectors as specified in the Readme - glove.840B.300d.txt. Probably is missing in the others; maybe increasing the vocab size would help.

It happens when a (real) word isn't found in your embedding. Then you end up with which should never be in any embedding dictionary (it's supposed to be a non-word token).

What should probably happen is it should return an average or random vector. In my "fix" it returns a zero vector (which is kind of ok too).

Priya22 commented 6 years ago

@nlothian Ah I see. Thanks!

aconneau commented 6 years ago

Hi, In demo.ipynb, if you use infersent1.pkl please use the standard GloVe vectors (because the LSTM has been trained with these ones), and use the fastText common-crawl vectors for infersent2.pkl. It is also important to specify the version in "params_model". Version is "1" for infersent1.pkl and "2" for infersent2.pkl. Thanks Alexis

iamkissg commented 5 years ago

Hi all,

This issue is due to the prepare_samples method (line 193-201 of models.py), as follows:

        # filters words without w2v vectors
        for i in range(len(sentences)):
            s_f = [word for word in sentences[i] if word in self.word_vec]
            if not s_f:
                import warnings
                warnings.warn('No words in "%s" (idx=%s) have w2v vectors. \
                               Replacing by "</s>"..' % (sentences[i], i))
                s_f = [self.eos]
            sentences[i] = s_f

the end of sentence token would be used to represent the whole sentence if all tokens in the sentence do not have corresponding vectors.

As @nlothian suggests, the simplest solution might be setting vector for EOS to mean or zero vector. Here, I add the following very short snippet to get_w2v method, before returning word_vec:

        if self.eos not in word_vec:
            word_vec[self.eos] = np.mean(np.stack(word_vec.values(), axis=0), axis=0)

It should also work for fasttext, i.e. version 2.