Input gradient not normalised when training CBOW

facebookresearch / fastText

Library for fast text representation and classification.

https://fasttext.cc/

MIT License

25.93k stars 4.72k forks source link

Input gradient not normalised when training CBOW #910

Open amoussawi opened 5 years ago

amoussawi commented 5 years ago

Hey! I see that the gradient of the input layer is not being normalised by number of input tokens when training CBOW (same for skipgram when there are ngrams). Is there a reason behind it? (I noticed the same in the original word2vec implementation of CBOW).

I noticed a more stable training when normalising the gradient than when not.

leileigan commented 5 years ago

Hi! I noticed the same issue as well. Have you tried to normalize the gradient of the input layer? Do the results have much difference?

amoussawi commented 5 years ago

we need @Celebio input here, because that part in the code was intentionally done that way, and doesn't look like a bug.

amoussawi commented 5 years ago

not normalising the gradient on input layer acts a multiplier of the learning rate on the input layer, not sure if this is the reason.

leileigan commented 5 years ago

Emm, my concern is that because the hidden layer is the average sum of ngrams, when the error back propagates to the input ngrams, the gradient should also be averaged for every ngrams. This is a little different to word2vec.

amoussawi commented 5 years ago

correct regarding weighting the gradient. but word2vec implementation is the same if you check it, which is weird. it's likely to be a bug in original word2vec implementation but not here because here it was intentionally not done.

p-stefanov commented 3 years ago

Maybe this is relevant to the discussion - https://arxiv.org/abs/2012.15332