Benchmarking on larger datasets (low accuracy)

ddehueck / jax-skip-gram-negative-sampling

A Jax implementation of word2vec's skip-gram model with negative sampling as described in Mikolov et al., 2013

MIT License

9 stars 1 forks source link

Benchmarking on larger datasets (low accuracy) #2

Open arjun-mani opened 3 years ago

arjun-mani commented 3 years ago

I've been doing some more work with this repo, and I think it'd be productive to do some benchmarking beyond the given example. For example I've started working with the wiki8 text corpus (first 10^8 bytes of Wikipedia) and running some tests; Gensim's implementation gives an accuracy of ~24% on analogies, while I'm only seeing an accuracy of ~5% with this model.

Ideally we wouldn't see this kind of gap, so maybe it'd be a good idea to do some testing on larger datasets? I can also share some code to this end.

ddehueck commented 3 years ago

Great catch. And yes this repo is very poorly benchmarked so this type of work is very much appreciated. If you have a repo demonstrating this difference I'd love to take a look!

I should have some free time in the coming weeks to make some improvements.

arjun-mani commented 3 years ago

Absolutely, really appreciate your responsiveness. I'm a bit busy this week with a deadline (related to this work) but will try to share a repo soon after. A couple of suggestions: adding subsampling of frequent words, and having two weight matrices (separate one for context and center lookup).

ddehueck commented 3 years ago

No problem happy to work towards making this repo into a good resource for people.

As for subsampling, it is done in sgns_loss.py wrt to a multinomial distribution in utils.py. I believe I found this method from another source so it may be worth reconsidering the actual implementation.

I've seen the two-weight matrices done before and I'm good to give it a try. Looking forward to making some improvements.

arjun-mani commented 3 years ago

I may be mistaken but I believe the code in sgns_loss.py is for negative sampling? What I meant by subsampling is to discard training examples based on frequency of center word in the dataset (Sec. 2.3 here: https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf)