Large number of zero vectors

shirish93 commented 8 years ago

Hello,

This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.

For reference, this is the code I used to count empty vectors:

empty = np.zeros(300)
count = 0
for each in englishVectors:
 if np.array_equal(each, empty):
  count +=1

I discovered this while trying to figure out the words closest to semi-common words.

For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following:

['adddresse', 'rudat', 'barhydt', 'weeked', 'inovonics', 'alleppey', 'katten', 'georgievski', 'kopinski', 'waxwing', 'irin_plusnews']

rspeer commented 8 years ago

It might be an issue with the version of the dataset I uploaded. I'll check.

rspeer commented 8 years ago

I just re-downloaded the 600d dataset and, while there are zero-vectors, there are only 5882 of them, which is identical to the number of zero vectors in the version I evaluated for the paper.

This narrows it down: either I made a mistake truncating the 600d vectors to 300d, and you downloaded the 300d version; or you made a mistake in post-processing the data. Can you tell me more specifically what you did?

rspeer commented 8 years ago

Confirmed that the 300d version, as uploaded, has the same 5882 zero-vectors. The error is in something you did with the data, I'd say.

shirish93 commented 8 years ago

Thanks, I'll work it out!

Thanks for the dataset also! It's extremely interesting to play around with it!

commonsense / conceptnet-numberbatch

Large number of zero vectors #32