HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.11k stars 389 forks source link

Form of the h function for infinite dataset #13

Closed brotherofken closed 4 years ago

brotherofken commented 4 years ago

Thanks for the great code and paper!

I have a question regarding the form of h function. I have a huge dataset, thus it's impossible to store all embeddings in memory, so I decided to increase the batch size and mine negatives from it. So far so good, but from my understanding due to big dataset size nominator almost equals denominator and h approaches 1.

Do you think that it's a good idea to replace h with an angular similarity between embeddings instead of the ratio proposed in the paper? Or maybe you could kindly propose some other appropriate choice for h?

HobbitLong commented 4 years ago

Hi, @brotherofken ,

Thanks for your interest. For the eq. 19 in the paper, h will automatically work if you set N and M as the number of negatives to pair each positive and the number of the dataset size, respectively. h approaches to 1 at the beginning, bu will be adjusted very quickly as the training proceeds. This is how NCE works.

Angular similarity might also work, however loses the spirit of posterior probability in NCE.

brotherofken commented 4 years ago

Thanks for the quick response!

I read the paper carefully and found that I missed that the temperature in (19) has to be quite low (0.02-0.3 in your experiments) to compensate for the small value of the N/M ratio. That became clear now. Thanks!