Closed MingSun-Tse closed 4 years ago
Hi, thanks for looking into details. The scripts provided serve the task of sharing toy examples, and we wanted to show it working so used a different model as a student that was trained to provide some meaningful result. In the original paper, ADI is always used with student that is undergoing the training procedure and is not pertained.
Hi, great thanks for the update! How about the complete data-free KD code? Some people got questions about the details. When will you release that?
Hi, thanks for looking into details. The scripts provided serve the task of sharing toy examples, and we wanted to show it working so used a different model as a student that was trained to provide some meaningful result. In the original paper, ADI is always used with student that is undergoing the training procedure and is not pertained.
"ADI is always used with student that is undergoing the training procedure and is not pertained." This approach leads to a huge computational cost. Because every time a batch (256 in total) of data is generated, it goes through 2000 iterations of training. And I now have doubts about whether the paper can be reproduced.
Hi, thanks for your great work! I noticed the student network is pretrained for the ADI experiment on ImageNet. This is quite strange since for data-free knowledge distillation, the goal is to train a student with the synthetic samples. If you already have a pretrained student, the problem does not exist from the beginning.
Meanwhile, for the cifar10 experiment, the student is not pretrained, which I think should be the normal setting though. But there is an inconsistency here. Could you explain a little what makes you choose different schemes for cifar10 and ImageNet? Thanks!