HobbitLong / RepDistiller

[ICLR 2020] Contrastive Representation Distillation (CRD), and benchmark of recent knowledge distillation methods
BSD 2-Clause "Simplified" License
2.11k stars 389 forks source link

questions about ContrastMemory #24

Open jianxiangm opened 4 years ago

jianxiangm commented 4 years ago

Hi, according to Eq.19 in the paper, linear transform gT and gS are conducted on the teacher and student, respectively, i.e., gT(t), gS(s).

But as for your codes, the teacher transform gT is applied on the student feature, gT(s) , and the student transform gS is applied on the teacher feature, gS(t), like out_v2 = torch.bmm(weight_v1, v2.view(batchSize, inputSize, 1)) out_v2 = torch.exp(torch.div(out_v2, T)) out_v1 = torch.bmm(weight_v2, v1.view(batchSize, inputSize, 1)) out_v1 = torch.exp(torch.div(out_v1, T))

and thus your contrast loss changes to be the addition of ContrastLoss(out_v1) + ContrastLoss(out_v2).

I wonder why you did this , instead of calculating output like Eq.19 by gT(t)*gS(s)/t and ContrastLoss(out).

Thanks.

yassouali commented 4 years ago

I had the same question, as I understand, ContrastLoss(out_v2) will not have any gradients given that the teacher is not being trained.

jianxiangm commented 4 years ago

I had the same question, as I understand, ContrastLoss(out_v2) will not have any gradients given that the teacher is not being trained.

the last fc layer is being trained in the teacher.

yassouali commented 4 years ago

Yes you are right, thanks.