commonsense / conceptnet-numberbatch

Other
1.28k stars 143 forks source link

Using the pretrained term vectors #46

Closed dineshbvadhia closed 7 years ago

dineshbvadhia commented 7 years ago

First-time using the pretrained term vectors and noticed the vectors are in a text file. The word2vec and the googlenews pretrained vectors can be loaded as a numpy array which in turn can optionally be read from disk with the mmap_mode. Given a term, look up an dictionary or hashtable to get an index for the term and then extract the term vector from the numpy array using the index value. I've used this successfully.

Can numberbatch be used in a similar way and if so how?

rspeer commented 7 years ago

Certainly, you can reformat the data however you want.

One thing I've found is that it's impractical to maintain downloadable releases of every format someone might need. It's also expensive: each separate download has to remain stored on a server for a long time so that links don't break. So when people want just the vectors, I provide them as the lowest common denominator, the word2vec/fastText format.

If you use the vectors via the conceptnet5 repository, you'll be working with them in the efficient HDF5 format. (You'll also get the benefit of using the ConceptNet graph to extend the vocabulary, which you can't get from the vectors alone.) But there isn't yet a good tutorial on how to work with the data in this form.

The best place for questions that are not bug reports, by the way, is the Gitter chat: https://gitter.im/commonsense/conceptnet5