explosion / sense2vec

🦆 Contextually-keyed word vectors
https://explosion.ai/blog/sense2vec-reloaded
MIT License
1.62k stars 239 forks source link

Question about reddit_vectors.bin #76

Closed dr-slurp closed 5 years ago

dr-slurp commented 5 years ago

Hey!

I'm wondering how the reddit_vectors.bin is formatted? I want to build a tool that can read the sense2vec reddit vectors but in Java (as the rest of my pipeline is in Java). I'm having trouble decoding the binary so I'd appreciate any hints as to how the vectors are stored in the binary. Is there a plain text version of the vectors available?

Thanks in advance

dr-slurp commented 5 years ago

Or if you could point me to the code that actually parses the vectors, I could probably figure it out myself.

Thanks

ines commented 5 years ago

Here's the relevant part that saves out the binary vectors:

https://github.com/explosion/sense2vec/blob/222551d95b4a1fa212fa134665131d46d72b355d/sense2vec/vectors.pyx#L290-L298

However, I've been refactoring the library (see #77) and the vectors are now stored using spaCy's Vectors, which are serialized as numpy arrays. Relevant part of the code is here:

https://github.com/explosion/spaCy/blob/c2f5f9f5727fd854a6147e71ed6625b7c57c4150/spacy/vectors.pyx#L397-L414

This will probably also make it easier to write your loader in Java. All other data (frequency counts, strings, config) will be stored as JSON btw.

githubuser100007 commented 3 years ago

I would appreciate the Java solution if you have it.