Closed bernatfp closed 8 years ago
Ah, this is true. GloVe's and word2vec's text formats are not exactly the same.
I haven't used GenSim much, but I'm interested in looking into making it convenient to load ConceptNet Numberbatch in GenSim, especially as a drop-in replacement for word2vec. So I'll upload files with the header line.
Hm. One thing that would be inconvenient about adding a header line is that then you couldn't concatenate together en_main
and en_extra
if you want to. And getting the number of lines is going to either require an extra scan through the data when writing or when reading, and it seems easier when reading.
I now think the way to avoid confusion and off-by-one errors will be to leave the files as they are, document them appropriately as being in GloVe format and not quite word2vec format, and look into a mechanism for loading them in GenSim.
Okay, the documentation is fixed, and I'll look into better GenSim integration in the future.
It looks like GloVe and Word2Vec have slightly different formats for their files, so I think it's a bit confusing to say that the models here are in the same format?
I noticed this when trying to load these embeddings into Gensim. Apparently the same problem exists with Glove, and this repository offers a solution that also works for the Conceptnet embeddings: https://github.com/manasRK/glove-gensim
Basically, the first line needs to indicate the number of word embeddings in the file and the number of dimensions of the vectors. I think it'd be a good idea to at least mention this in the README.