VHRanger / nodevectors

Fastest network node embeddings in the west
MIT License
520 stars 59 forks source link

node2vec uses CBOW instead of skip-gram #40

Open ubalklen opened 3 years ago

ubalklen commented 3 years ago

Node2vec and DeepWalk original proposals are built upon the skip-gram model. By default, nodevectors does not set the parameter w2vparams["sg"] to 1, therefore the underlying Word2Vec model uses the default value of 0, which means using CBOW instead of skip-gram. This has major consequences in the quality of the embeddings.

VHRanger commented 3 years ago

Thanks, can you confirm empirically that 1 is better than 0?

If so, I'll change the default along with other udpates this week.

ubalklen commented 3 years ago

It is for my graphs, but I'm not sure if this is always the case. There is some discussion about which one is better, but in the context of NLP. I couldn't find anyone discussing that in the context of graph embeddings.

Anyway, I suggest you to not only use 1 as the default, but also force w2vparams["sg"] to 1 instead of leaving this decision to the programmer, or make this a separate parameter in the same way you did with w2vparams["workers"]. The reason is that it is very easy for the programmer to forget to set this parameter when he wants to customize other w2v parameters (this was exactly how I stumbled upon this). And node2vec was built explicitly with skip-gram in mind.

rn123 commented 3 years ago

Check the reference below for skip-gram vs CBOW. The quality difference between the two seems to be the fault of a longstanding implementation error in the original word2vec and Gensim implementations.

İrsoy, Ozan, Adrian Benton, and Karl Stratos. “Koan: A Corrected CBOW Implementation.” ArXiv:2012.15332 [Cs, Stat], December 30, 2020. http://arxiv.org/abs/2012.15332.

gojomo commented 2 years ago

Just passing through, noticed this issue in the course of answering someone's Node2Vec/Word2Vec interaction question, thought I'd mention: