Parameters for RL-based WeightSparseLearner on MobileNets

Tencent / PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.

https://pocketflow.github.io

Other

2.78k stars 490 forks source link

Parameters for RL-based WeightSparseLearner on MobileNets #150

Closed d-jax closed 5 years ago

d-jax commented 5 years ago

Hello PocketFlow team,

What are the parameters for reproducing the compressed MobileNets using weight sparsification as reported by the PocketFlow paper? Thank you.

jiaxiang-wu commented 5 years ago

To reproduce results on MobileNet, you can use the following hyper-parameters:

$ ./scripts/run_docker.sh nets/mobilenet_at_ilsvrc12_run.py -n=8 \
    --learner weight-sparse \
    --enbl_dst \
    --spl_lrn_rate_rg 3e-3 \
    --spl_nb_iters_rg 100 \
    --spl_lrn_rate_ft 3e-4 \
    --spl_nb_iters_ft 1600

which specifies the WeightSparseLearner and enables the distillation loss. The last four lines adjusts learning rates and number of iterations in each roll-out's fine-tuning process. The default values are tuned to be optimal on CIFAR-10, but ILSVRC-12 data set requires smaller learning rates and more iterations to recover the accuracy after compression in each roll-out.

d-jax commented 5 years ago

@jiaxiang-wu This is super helpful. Thanks for sharing. How transferrable do you think these params are from MobileNet V1 to, say MobileNet V2, Inception, etc.? It'd be great to learn the optimal training parameters for more popular models.

jiaxiang-wu commented 5 years ago

These hyper-parameters work well on both ResNet and MobileNet-v1. I think they can be generalized to MobileNet-v2 or Inception models. The change for the last four hyper-parameters is mainly due to different data sets (CIFAR-10 vs. ILSVRC-12), instead of different models.

d-jax commented 5 years ago

Thanks again. One more question: is there a sparsity/performance gap between the 'heurist' and 'optimal' protocols for weight sparsification? I assume the 'heurist' protocol follows the "To prune, or not to prune" paper.

jiaxiang-wu commented 5 years ago

The "heurist" protocol does not correspond to the "To prune, or not to prune" paper. We think the core idea of that paper is to use a gradually increasing pruning ratio to boost the sparsity-constrained training, and this is implemented in WeightSparseLearner, regardless whatever protocol you are using. The "heurist" protocol actually follows the assumption that layers with more parameters should have larger redundancy and can be assigned with a larger pruning ratio.