commonsense / conceptnet-numberbatch

Other
1.29k stars 143 forks source link

Format different from Word2Vec's format? #36

Closed bernatfp closed 8 years ago

bernatfp commented 8 years ago

It looks like GloVe and Word2Vec have slightly different formats for their files, so I think it's a bit confusing to say that the models here are in the same format?

I noticed this when trying to load these embeddings into Gensim. Apparently the same problem exists with Glove, and this repository offers a solution that also works for the Conceptnet embeddings: https://github.com/manasRK/glove-gensim

Basically, the first line needs to indicate the number of word embeddings in the file and the number of dimensions of the vectors. I think it'd be a good idea to at least mention this in the README.

rspeer commented 8 years ago

Ah, this is true. GloVe's and word2vec's text formats are not exactly the same.

I haven't used GenSim much, but I'm interested in looking into making it convenient to load ConceptNet Numberbatch in GenSim, especially as a drop-in replacement for word2vec. So I'll upload files with the header line.

rspeer commented 8 years ago

Hm. One thing that would be inconvenient about adding a header line is that then you couldn't concatenate together en_main and en_extra if you want to. And getting the number of lines is going to either require an extra scan through the data when writing or when reading, and it seems easier when reading.

I now think the way to avoid confusion and off-by-one errors will be to leave the files as they are, document them appropriately as being in GloVe format and not quite word2vec format, and look into a mechanism for loading them in GenSim.

rspeer commented 8 years ago

Okay, the documentation is fixed, and I'll look into better GenSim integration in the future.