Open arjun-mani opened 3 years ago
Great catch. And yes this repo is very poorly benchmarked so this type of work is very much appreciated. If you have a repo demonstrating this difference I'd love to take a look!
I should have some free time in the coming weeks to make some improvements.
Absolutely, really appreciate your responsiveness. I'm a bit busy this week with a deadline (related to this work) but will try to share a repo soon after. A couple of suggestions: adding subsampling of frequent words, and having two weight matrices (separate one for context and center lookup).
No problem happy to work towards making this repo into a good resource for people.
As for subsampling, it is done in sgns_loss.py
wrt to a multinomial distribution in utils.py
. I believe I found this method from another source so it may be worth reconsidering the actual implementation.
I've seen the two-weight matrices done before and I'm good to give it a try. Looking forward to making some improvements.
I may be mistaken but I believe the code in sgns_loss.py
is for negative sampling? What I meant by subsampling is to discard training examples based on frequency of center word in the dataset (Sec. 2.3 here: https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)
I've been doing some more work with this repo, and I think it'd be productive to do some benchmarking beyond the given example. For example I've started working with the wiki8 text corpus (first 10^8 bytes of Wikipedia) and running some tests; Gensim's implementation gives an accuracy of ~24% on analogies, while I'm only seeing an accuracy of ~5% with this model.
Ideally we wouldn't see this kind of gap, so maybe it'd be a good idea to do some testing on larger datasets? I can also share some code to this end.