Question about replacing RIDE (stage 1)

LTH14 / targeted-supcon

A PyTorch implementation of the paper Targeted Supervised Contrastive Learning for Long-tailed Recognition

MIT License

93 stars 13 forks source link

Question about replacing RIDE (stage 1) #7

Closed WeichengDai1 closed 1 year ago

WeichengDai1 commented 1 year ago

Hello author, Thank you for your awesome work! I have found it very interesting. However I have some questions related to combining TSC with RIDE.

In your paper, you mentioned to replace RIDE stage 1 with TSC, while keeping the second stage of RIDE. However the detail is not clear to me. Based on my understanding, RIDE (stage 1) encourages the experts to be diversified, while TSC encourages the classes to be well separated. The ideas seem different. So I wonder how to replace RIDE (stage 1) with TSC.
Related to the above question, I do have a guess about how to replace RIDE (stage 1) with TSC. My guess is, if k MOEs are used, and we have N classes, then we should maintain and update k*N targets. This will encourage the output of the experts to be well separeted. Please correct me if I am wrong.

Thank you!

LTH14 commented 1 year ago

Hi! Thanks for your interest. RIDE has several components (ablated in Table 4 of the paper) that contribute to its performance improvement. When combined with TSC, we found that the improvement from distribution-aware diversity loss is marginal, while the other three components bring more significant improvement. Your suggestion in 2 would be quite an interesting direction to explore, but we did not conduct such an experiment in the paper.

WeichengDai1 commented 1 year ago

Hi! Thank you for your answer. So I would assume that you have incorporated MOE, trained them using same N targets for N classes, and use Loss_TSC(eq3), and L_u(eq1), Loss_distill(RIDE) for stage 1; and then use Loss_routing(RIDE) for stage 2? Is that correct?

LTH14 commented 1 year ago

Yes -- except that the class assignment of the N targets for each expert is not the same.

WeichengDai1 commented 1 year ago

Hi, thank you for your clarification. I do have another question. In your main work, I think you adopted the same framework as MoCo, which consists of a teacher and a student. However in your ablation part as using RIDE, did you also keep the teacher & student framework? Or just adopt the same framework as RIDE? Thank you!

LTH14 commented 1 year ago

Yes -- the contrastive loss is computed between encoder q and encoder k, where k is a momentum encoder of q.

WeichengDai1 commented 1 year ago

Thank you for your throughful explanation, I have learnt a lot!