A question about the training epochs

zxGuo commented 3 years ago

Hi, thanks for the great work! I think it is really a valuable observation that looing for the optimal structure should be the real value of channel pruning. But I have a question: when you compare the performance of the Fine-tuned, Scratch-E and Scratch-B, it seems finetuning only takes a small number of epochs. For example Cifar10, the finetuning only takes 40 epochs while scratch-E takes 160 epochs. Can it be the reason why Scratch models outperform the Finetuned ones? From my experiments, I find finetuning 40 epochs is really not enough, especially for the higher pruning ratios. I think it is maybe not so equal to comapre two models with so different training epochs.

Eric-mingjie commented 3 years ago

when fine-tuning, we are initializing the network with part of the pretrained networks; when doing scratch-E and scratch-B, we are initializing the network with random weights; so you shouldn't count the epochs of fine-tuning as 40, it should be 160 + 40 epochs, for pretraining + fine-tuning is the overall procedure for obtaining a final network from random initialization. In this paper, for simplicity, we count the total compute budget of fine-tuning as 160 epochs (not 200).

zxGuo commented 3 years ago

It makes sense. Thanks for the fast reply!

Eric-mingjie / rethinking-network-pruning

A question about the training epochs #48