facebookresearch / InferSent

InferSent sentence embeddings
Other
2.28k stars 471 forks source link

inconsistent sentences length causes encoding failure #155

Open yilil opened 11 months ago

yilil commented 11 months ago

Problem Description: I tried to run the demo but encountered the following error:

sentences = np.array(sentences)[idx_sort]
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (240,) + inhomogeneous part.

This error is thrown at the second last line of the prepare_sample() function, which is a helper function invoked when encoding sentences. The (240, ) indicates that I have 240 sentences that I'd like to encode.

The problem lies in the second last line, sentences = np.array(sentences)[:idx_sort]

The input sentences can have various length, after the above operations (tokenisation, filtering etc.), sentences may look like:

[
  ['<s>',  'token1', 'token2', '</s>'],
  ['<s>',  'token1', 'token2', 'token3', 'token4',  '</s>'],
  ['<s>',  'token1', 'token2', 'token3', '</s>']
]

Converting this list of list into a numpy array can fail, as each inner list that represents a sentence may have various length as shown above

I'm using numpy 1.25.2. My hypothesis is that you might have used a different older version of numpy, which may implictly handle or ignore this.

Solutions: The fix is simple:

  1. Pad the tokenised sentence lists (after sorting, but before the numpy array conversion) to make them have equal length
  2. Change the second last line to sentences = np.array(sentences, dtype=object)[idx_sort]

Though the later approach seems like a much simpler fix (it forces numpy to treat each inner list as an object and therefore variable length is allowed), it can cause computation inefficiency, especically if we plan to do mathematical operations.