changlin31 / DNA

(CVPR 2020) Block-wisely Supervised Neural Architecture Search with Knowledge Distillation
235 stars 35 forks source link

The model after searching for the best architecture under constraint #31

Closed KelvinYang0320 closed 2 years ago

KelvinYang0320 commented 2 years ago

Hi @changlin31 , thank you for your great work! I really enjoyed your paper. I want to ask you a question regarding the checkpoint of models under constraint.

Thank you in advance!

changlin31 commented 2 years ago

Hi @KelvinYang0320 ,

Thanks for your interest!

If I want to test this best model architecture, should I load the student supernet from the checkpoint in step (i) and use the encoding code to perform this subnetwork in the supernet?

Since we have already trained these blocks w/ KD in the previous step (i), I think I can get the trained model directly in the step (ii) and get something like a single model checkpoint like DNA_a, DNA_b, etc?

  • By default, we retrain the searched architectures from scratch in DNA. But it still worth a try to direct load sub-network weights from the supernet. As the supernet is only trained with feature distillation, it doesn't have a classifier. Thus, finetuning is required if you want a complete model with classifier. Please refer to this great work, DONNA ICCV21, which directly finetunes the sub-network in DNA supernet.

Can you point out how to setup the constraint to get these DNA_a ~ DNA_d?

  • DNA_a flops: 350M;
  • DNA_b flops: 410M;
  • DNA_c params: 5.3M;
  • DNA_d no constraint.
KelvinYang0320 commented 2 years ago

@changlin31 Thank you for the suggestion. I will read through DONNA !😄

DNA_a flops: 350M; DNA_b flops: 410M; DNA_c params: 5.3M; DNA_d no constraint.

As for the constraint for DNA_a~DNA_d, are these searched in the same supernet? 3 different student supernets are shown in Table 1.

Also, is DNA_b FLOPs: 399M as you wrote in the paper? Or it is a typo in the paper?(EfficientNet-B0 FLOPs is also 399M and DNA_b FLOPs is 406M which is over 399M in Table 2)

changlin31 commented 2 years ago

These architectures were searched by combining all the three supernets. However, we later found that DNA works better (have higher architecture ranking correlation) on single supernet. So searching in a single supernet maybe a better choice.

Yes, we were actually using a little larger FLOPs than 399M, as long as it's still comparable.

KelvinYang0320 commented 2 years ago

@changlin31 Thanks for the information!

These architectures were searched by combining all the three supernets.

Since these architectures were searched by all the three supernets, in order to reproduce your DNA_a~DNA_d searching results, I should do the step (i) by modifying the self.block_cfgs in the student_supernet.py according to these three supernets in Table 1?

To apply these constraints, do I just need to change the target_constrain to one of these constraints and change the target to 'params' or 'flops'? Should I also change the calculation in stage_max_param or other code?

I don't fully understand the calculation in these lines about the other_params and stage_max_params. Could you explain this part, please? Thanks a lot!

changlin31 commented 2 years ago

Hi, @KelvinYang0320

1.

Since these architectures were searched by all the three supernets, in order to reproduce your DNA_a~DNA_d searching results, I should do the step (i) by modifying the self.block_cfgs in the student_supernet.py according to these three supernets in Table 1?

Yes, you could do that.

2.

To apply these constraints, do I just need to change the target_constrain to one of these constraints and change the target to 'params' or 'flops'? Should I also change the calculation in stage_max_param or other code?

Yes, the constraints need to be changed. The params of different supernet can be automatically calculated. Whereas, if the searching target is flops, you should calculate the flops table, as we did not implement automatic flops calculation. https://github.com/changlin31/DNA/blob/156c16ce2701b345286b0c946d63e92ea37a67ca/searching/process_potential.py#L115 You can use this tool: fvcore to calculate flops table, for example:

import torch
from fvcore.nn import FlopCountAnalysis, flop_count_table
input = torch.randn(1, 3, 224, 224)
flops = FlopCountAnalysis(model, input)
print(flop_count_table(flops))
print(flops.total())

3.

I don't fully understand the calculation in these lines about the other_params and stage_max_params. Could you explain this part, please? Thanks a lot!

These lines are setting the max flops for each stage to accelerate the searching process. For example, if the flops target is 400M, and the smallest operation in the last stage have 100M flops, then it's not necessary to continue searching the 2nd last stage when reaching 300M.

KelvinYang0320 commented 2 years ago

Thank you so much for your detailed explanation.:smile: