frank-xwang / RIDE-LongTailRecognition

[ICLR 2021 Spotlight] Code release for "Long-tailed Recognition by Routing Diverse Distribution-Aware Experts."
MIT License
261 stars 26 forks source link

[Conceptual question] Question about self-distillation #3

Closed seekingup closed 3 years ago

seekingup commented 3 years ago

Thanks for your excellent code! It's very easy to get started!

What the meaning of --distill_checkpoint path_to_checkpoint? Do I need to pre-train another model and use it for distillation?

I have simply trained a ResNet50 (2 experts) on ImageNet-LT without Self-distillation, and the top-1 accuracy is 53.264%, which is 1% lower than that in the paper. Will it be helpful if I use self-distillation?

frank-xwang commented 3 years ago

Thank you for your interest in our work.

  1. If you point the distill_checkpoint argument to a checkpoint (RIDE with 6 experts), RIDE will be end-to-end optimized with distillation loss, which could bring about 0.4%-1% improvements (different experiments may observe differently). For now, we have not provided any pre-trained checkpoints, therefore, you may have to pre-train a RIDE model with 6 experts as the teacher model. We will update the model zoo recently. You can download the teacher model there once it is updated.

  2. Yes, you may need to enable distillation to reproduce the reported results in the paper. We also provided the results of RIDE without distilling with ResNeXt50 as a backbone in model zoo. We tried it locally before releasing this codebase, we could reproduce the reported results in the model zoo with this reorganized codebase.

Please let us know if you meet any further questions reproducing the results.