Closed shirish93 closed 8 years ago
It might be an issue with the version of the dataset I uploaded. I'll check.
I just re-downloaded the 600d dataset and, while there are zero-vectors, there are only 5882 of them, which is identical to the number of zero vectors in the version I evaluated for the paper.
This narrows it down: either I made a mistake truncating the 600d vectors to 300d, and you downloaded the 300d version; or you made a mistake in post-processing the data. Can you tell me more specifically what you did?
Confirmed that the 300d version, as uploaded, has the same 5882 zero-vectors. The error is in something you did with the data, I'd say.
Thanks, I'll work it out!
Thanks for the dataset also! It's extremely interesting to play around with it!
Hello,
This could be the case with my processing, but it appears that 617, 129 out of the 665, 494 english vectors are zero vectors: they are defined in the label, but have all zeros (ie, there are only 48, 365 non-zero vectors for English). I discovered this with the 300-sized dataset. Might this be an issue with the uploaded dataset, or should I recheck my methodology? If you could confirm this is not the issue on your side using the dataset available for download, I can work on fixing on my side.
For reference, this is the code I used to count empty vectors:
I discovered this while trying to figure out the words closest to semi-common words.
For reference, using your code for 'most similar', the words that seem to be representative of the 'zero vectors' are the following: