facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 470 forks source link

ValueError when encoding #40

Closed briandw closed 6 years ago

briandw commented 6 years ago

I'm running the encoder/demo.ipynb notebook with Python2.7 and PyTorch '0.1.12_1' When running the line

embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)

I get the following error:

Nb words kept : 129333/130068 (99.43 %)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-c3081b78b915> in <module>()
----> 1 embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
      2 print('nb sentences encoded : {0}'.format(len(embeddings)))

/home/brian/InferSent/encoder/models.pyc in encode(self, sentences, bsize, tokenize, verbose)
    207                 (batch, lengths[stidx:stidx + bsize])).data.cpu().numpy()
    208             embeddings.append(batch)
--> 209         embeddings = np.vstack(embeddings)
    210 
    211         # unsort

/home/brian/anaconda3/envs/py2/lib/python2.7/site-packages/numpy/core/shape_base.pyc in vstack(tup)
    235 
    236     """
--> 237     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    238 
    239 def hstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly

Not sure if this is related but loading the model produces a warning

# make sure models.py is in the working directory
model = torch.load('infersent.allnli.pickle')
/home/brian/anaconda3/envs/py2/lib/python2.7/site-packages/torch/serialization.py:284: SourceChangeWarning: source code of class 'torch.nn.modules.rnn.LSTM' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
aconneau commented 6 years ago

The error message you have comes from this:

import numpy as np
embeddings = [np.zeros((64, 4096)), np.zeros((64, 4096))]
embeddings = np.vstack(embeddings) # no error
embeddings = [np.zeros((64, 4096)), np.zeros((64, 4096)), np.zeros((64, 3412))]
embeddings = np.vstack(embeddings) # error
# -> ValueError: all the input array dimensions except for the concatenation axis must match exactly

For some reasons, one of the element in "embeddings" is not of size (batch_size=128, emb_dim=4096). So there must be one or more element of size different than (128, 4096).

1) Just before the error in line 209, could you print the shape of each element in embeddings?

for batch in embeddings:
    print(batch.shape)

to see if we can spot the element with the wrong size.

2) What is in "sentences"? Can you check that you don't have an empty sentence?

3) what is the length of "sentences" ?

4) Could you update pytorch to a more recent version and see if you still have the issue?

briandw commented 6 years ago

Thanks for the quick response.

I believe that I'm on the latest torch version of 0.1.12_1. Is there a later version?

sentences length is is 9815 and there are no 0 length sentences in the array.

This is the output from just before the line 209:

Nb words kept : 128201/130068 (98.56 %)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 64, 4096)
(1, 23, 4096)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-3d88dd6254e6> in <module>()
      1 tmp = sentences[:128]
----> 2 model.encode(sentences, tokenize=False, verbose=True)

/home/brian/InferSent/encoder/models.py in encode(self, sentences, bsize, tokenize, verbose)
    210         for batch in embeddings:
    211             print(batch.shape)
--> 212         embeddings = np.vstack(embeddings)
    213 
    214         # unsort

/home/brian/anaconda3/envs/py2/lib/python2.7/site-packages/numpy/core/shape_base.pyc in vstack(tup)
    235 
    236     """
--> 237     return _nx.concatenate([atleast_2d(_m) for _m in tup], 0)
    238 
    239 def hstack(tup):

ValueError: all the input array dimensions except for the concatenation axis must match exactly
aconneau commented 6 years ago

Oh ok I get it. Can you try to change the line in models.py here: https://github.com/facebookresearch/InferSent/blob/master/encoder/models.py#L67

emb = torch.max(sent_output, 0)[0]

into:

emb = torch.max(sent_output, 0)[0].squeeze(0)

and see if this works then?

briandw commented 6 years ago

That's working now. Thanks! I wonder why this didn't show up before?

aconneau commented 6 years ago

@briandw So this is an issue linked to the change of policy in pytorch functions such as max, mean, sum etc.

If you have a tensor of size (say) (23, 128, 4096). If you take the torch.max (or torch.mean ..) over the first dimension, then you get a tensor of size:

(128, 4096) for recent versions of pytorch (1, 128, 4096) for old versions of pytorch

So it means your version of pytorch is too old. I will update the requirement part in the README, and add an exception in the models.py to handle this case.

Thanks

aconneau commented 6 years ago

https://github.com/facebookresearch/InferSent/commit/4b7f9ec7192fc0eed02bc890a56612efc1fb1147

Pragtisood commented 1 year ago

setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (9815,) + inhomogeneous part. getting this error on embeddings = infersent.encode(sentences, bsize=128, tokenize=False, verbose=True) print('nb sentences encoded : {0}'.format(len(embeddings)))