Open amoussawi opened 5 years ago
Hi! I noticed the same issue as well. Have you tried to normalize the gradient of the input layer? Do the results have much difference?
we need @Celebio input here, because that part in the code was intentionally done that way, and doesn't look like a bug.
not normalising the gradient on input layer acts a multiplier of the learning rate on the input layer, not sure if this is the reason.
Emm, my concern is that because the hidden layer is the average sum of ngrams, when the error back propagates to the input ngrams, the gradient should also be averaged for every ngrams. This is a little different to word2vec.
correct regarding weighting the gradient. but word2vec implementation is the same if you check it, which is weird. it's likely to be a bug in original word2vec implementation but not here because here it was intentionally not done.
Maybe this is relevant to the discussion - https://arxiv.org/abs/2012.15332
Hey! I see that the gradient of the input layer is not being normalised by number of input tokens when training CBOW (same for skipgram when there are ngrams). Is there a reason behind it? (I noticed the same in the original word2vec implementation of CBOW).
I noticed a more stable training when normalising the gradient than when not.