NVlabs / DeepInversion

Official PyTorch implementation of Dreaming to Distill: Data-free Knowledge Transfer via DeepInversion (CVPR 2020)
Other
479 stars 77 forks source link

Questions on the KD process on CIFAR10 dataset #6

Open zkf85 opened 3 years ago

zkf85 commented 3 years ago

Hi there,

You work is great! Here I have some questions about the Knowledge Distillation process on CIFAR10 dataset in you experiment part.

  1. How many CIFAR10-like images have you generated in order to reach those accuracies is Table 1 in your paper? As we have tried with 3000 or 10000 generated images (with DI,Resnet34,alpha_f = 10) using vanilla KD to distill from Resnet34 to Resnet18 and only reached 25% or 55% validation acc.

  2. We encountered problems when trying ADI. In the description of Table 1, it's said "for ADI, we generate one new batch of images every 50 KD iterations and merge the newly generated images into the existing set of generated iamges". Could you please explain more about this? Does the 50KD iteration mean 50 KD epochs? Does the "one new batch of images" mean a batch of like 256 images and merge them into the exisiting generated dataset? Does the KD process have to hang up and wait for the "new-batch-generating" process every 50 KD iteration (epoch if I get it correctly)?

Thanks

pamolchanov commented 3 years ago

Answering your questions:

  1. For these experiments we generated 1000 batches of batch size 256 in total. DI/ADI generate the entire batch of data at once. Most likely hyper parameters are not correct and therefore results are different.

  2. ADI is implemented in the same manner as DI with the difference of considering student model. 50 KD iterations means 50 update steps of the optimizer. Each batch has roughly 195 updates, we generate batches for DI or ADI at updates number 0, 49, 99, 149, having 4 batches generated during one epoch. All newly generated batches are added to the pool. On iteration updates not when batch generation happens (191) we randomly select a batch from those in the pool. Total KD training happens for 250 echoes that leads to 1000 batches in the end. The more we train - the more data there is going to be in the pool. Initially we start with 50 pre-generated batches with DI and replace them if ADI is used with every new batch. We apply random translation +-2 px to every image when we load the batch from the pool.

We agree that sharing code for Cifar KD will be helpful and will try to do it asap.

shannonjryan commented 3 years ago

Hi @pamolchanov just wanted to check in and see if you still plan on releasing additional code for CIFAR KD (or ImageNet for that matter)?

Thanks for sharing your research!