Closed brotherofken closed 4 years ago
Hi, @brotherofken ,
Thanks for your interest. For the eq. 19 in the paper, h
will automatically work if you set N
and M
as the number of negatives to pair each positive and the number of the dataset size, respectively. h
approaches to 1 at the beginning, bu will be adjusted very quickly as the training proceeds. This is how NCE works.
Angular similarity might also work, however loses the spirit of posterior probability in NCE.
Thanks for the quick response!
I read the paper carefully and found that I missed that the temperature in (19) has to be quite low (0.02-0.3 in your experiments) to compensate for the small value of the N/M ratio. That became clear now. Thanks!
Thanks for the great code and paper!
I have a question regarding the form of
h
function. I have a huge dataset, thus it's impossible to store all embeddings in memory, so I decided to increase the batch size and mine negatives from it. So far so good, but from my understanding due to big dataset size nominator almost equals denominator andh
approaches 1.Do you think that it's a good idea to replace
h
with an angular similarity between embeddings instead of the ratio proposed in the paper? Or maybe you could kindly propose some other appropriate choice for h?