HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.12k stars 391 forks source link

The setting of Z_v1 and Z_v2 in class ContrastMemory? #5

Closed HaoKun-Li closed 4 years ago

HaoKun-Li commented 4 years ago

Thanks for your great work and great code!

When I read your code of class "ContrastMemory" in "memory.py", I can not find the related introduction about the use of "Z_v1" and "Z_v2" in your arXiv preprint paper. I want to know why the "out_v1" should divide "Z_v1"? If the "outputSize" is big, then the "out_v1“ may be very small. And the "outputSize" is very different between datasets, will it influence the value of "out_v1" too much, and even influence the performance of the student network?

Looking forward to your reply. @HobbitLong

Here is the related code:

        # set Z if haven't been set yet
        if Z_v1 < 0:
            self.params[2] = out_v1.mean() * outputSize
            Z_v1 = self.params[2].clone().detach().item()
            print("normalization constant Z_v1 is set to {:.1f}".format(Z_v1))
        if Z_v2 < 0:
            self.params[3] = out_v2.mean() * outputSize
            Z_v2 = self.params[3].clone().detach().item()
            print("normalization constant Z_v2 is set to {:.1f}".format(Z_v2))

        # compute out_v1, out_v2
        out_v1 = torch.div(out_v1, Z_v1).contiguous()
        out_v2 = torch.div(out_v2, Z_v2).contiguous()
HobbitLong commented 4 years ago

Hi, @HaoKun-Li ,

The short answer is that, these two can be viewed as two constants to scale up the dynamic range of the score function.

Typically, NCE can deal with unnormalized distribution and will automatically adjust the score range. In this specific case, the score is produced by the inner product of two l2-normalized vector, which means it's range is [-1, 1]. This range might not be enough for NCE's score adjustment. So here Z_v1 and Z_v2 are very simple monte carlo estimation of the partition function of full softmax (see sec 2.4 in this paper and sec 3.4 in this paper) to help adjust the score range a bit.

HaoKun-Li commented 4 years ago

Thanks for your reply!

HobbitLong commented 4 years ago

@HaoKun-Li , you are welcomed. I just closed it, but feel free to reopen it if you would like to discuss more.