Tencent / PocketFlow

An Automatic Model Compression (AutoMC) framework for developing smaller and faster AI applications.
https://pocketflow.github.io
Other
2.78k stars 490 forks source link

Did you have the performance summary of the RL-based quantization performance on MobileNet? #115

Closed brisker closed 5 years ago

brisker commented 5 years ago

Did you have the performance summary of the RL based quantization performance on MobileNet? Given that you have described about this here : https://pocketflow.github.io/reinforcement_learning/

haolibai commented 5 years ago

Yes. We have some preliminary results of MobileNet on Imagenet dataset, with uql_evquivalent_bits=4 or nuql_evquivalent_bits=4. ( 8 bit quantization does not hurt the performance, while 2 bit degrade the performance significantly).

For uql_evquivalent_bits=4, the top-1 acc is 47.1%~53.0%. For nuql_evquivalent_bits=4, the top-1 acc is 64.5%~65.9%.

The accuracy on validation set is measured at a fixed interval. Note that the variance of acc could be large sometimes. The above results may be improved if hyper-parameters of the RL agent are more carefully designed.

brisker commented 5 years ago

@haolibai Why is the non-uniform quantization performance much better than uniform quantization? In standard quantization methods, it seems to be not like this.

haolibai commented 5 years ago

Generally non-uniform quantization could have larger model capacity and thereon can outperform uniform quantization, especially in low-bit settings.

brisker commented 5 years ago

How to reproduce the RL-based quantization performance you have claimed? Besides, your claimed performace is more like a range, "47.1%~53.0%." So what is the difference between the experiment settings of 47.1 and 53.0? @haolibai

haolibai commented 5 years ago

The model is evaluated every 10000 steps, so there could be some variance on the accuracy. We found that for mobilenet, the variance is a little bit large (e.g., 47.1%~53.0%). Remain the default parameters for the RL agent, i.e., run ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \ --learner uniform \ --uql_enbl_rl_agent

brisker commented 5 years ago

@haolibai

  1. the performance of MobileNet you mentioned above, is it Mobilenet v1 or v2?
  2. What is the 32-bit precision performance of the v1 and v2 mobilenet model, downloaded from https://api.ai.tencent.com/pocketflow ??
haolibai commented 5 years ago
  1. Mobilenet v1
  2. The full precision results are in https://pocketflow.github.io/pre_trained_models/ Downloaded from: https://api.ai.tencent.com/pocketflow/models_mobilenet_v1_at_ilsvrc_12.tar.gz
brisker commented 5 years ago

@haolibai I have run this command:(with default settings) bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4, and in the first several iterations, the acc is so low(see the pic below). Is this normal? snipaste_2018-12-03_19-52-50 (the log says "rostoring parameters from ./models/model.ckpt-166818", indicating it has already loaded the full-precision parameters)

brisker commented 5 years ago

@jiaxiang-wu @haolibai it has already run for over 6 roll-out, the acc is still very low. Is this normal? I just follow the default settings.

haolibai commented 5 years ago

This could be possible in initial roll-outs. You may increase the learning rate a little so as to effectively verify the current strategy. Every time the model will be restored from a full-precision one and quantize it given the current strategy to see how good it performs. The best performance throughout 200 roll-outs will be recorded, which is the best strategy correspondingly.

brisker commented 5 years ago

@haolibai But I just follow the default settings. ... Did you mean the default settings have bugs?.... Now it has enters the 10th roll-out, still nearly 0 acc.

haolibai commented 5 years ago

The first 25% epochs are just for collecting training data for the buffer of DDPG, while the RL agent has not started training. The searching space is a little bit large, and MobileNet-v1 is indeed sensitive to low-bit quantization, so in most cases if the strategy is not good enough, the result will be pretty poor.

If I remember correctly, I generate the above results using the default settings. Nevertheless, the default setting may be not the best setting for the training.

brisker commented 5 years ago

@haolibai Besides, I have used 4 gpus, but when running nvidia-smi, the actual gpu-util of the last 3 gpus is actually 0%. So does the RL-based quantization actually support multi-gpu training?

haolibai commented 5 years ago

The training of RL agent will use only one GPU, while for training the network parameters will use 4 GPUs. Please check if you successfully use 4 GPUs when using other learners?

brisker commented 5 years ago

@haolibai Currently I am only using the RL-based UniformQuantization learner. The mobilenet-v1 RL-based model has run over 120 roll-out, and the acc is still nearly 0. I just follow the default settings with the following command: bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4 I am wondering if the version of the code has some changes, given that you mentioned you have got a 47.1%~53.0% performance with mobilenet-v1 default settings.

(I got successfully run a resnet20-cifar model with RL-based 4-bit quantization training, with around 74.6% top-1 acc, which is still poor but not nearly 0, with the following command: bash ./scripts/run_local.sh nets/resnet_at_cifar10_run.py --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4)

haolibai commented 5 years ago

Did you set enbl_warm_start=True? Since the quantized network is based on a pretrained model. Sorry this should be added in the tutorial.

brisker commented 5 years ago

previously no. But just a moment ago, as you have mentioned, I added this flag into the commandand note that this time I set the equivalent_bits=8) and run it, but still, in the first several iterations, the acc is still nearly 0, . It seems to be no difference. So I assume that there must be other possible mistakes in the default settings. @haolibai

haolibai commented 5 years ago

I tried a moment ago, using the following configurations, and there are effective accs.

./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=8 \ --learner uniform \ --enbl_warm_start \ --uql_enbl_rl_agent \ --uql_equivalent_bits 4 \ --uql_w_bit_min 2 \ --uql_w_bit_max 8

image

(seven is just a distributed system, you can replace it with your own GPUs) Please make sure that you have successfully restored a pretrained MobileNet-v1, and correct hyper-parameters.

brisker commented 5 years ago

@haolibai but what I refer to is mobilenet_v1, and the command I run is bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4

jiaxiang-wu commented 5 years ago

@brisker The key is that you need to enable warm-start by specifying --enbl_warm_start.

haolibai commented 5 years ago

The following are just the default settings, you can ignore them: --uql_w_bit_min 2 --uql_w_bit_max 8

brisker commented 5 years ago

@haolibai The problem seems not to be totally solved. I noticed that even though the allocated layer bits is a lot higher than 2-bit or 3-bit, the acc is still nearly 0. Here is my log. Could you please check it ? I just can not figure out what is going wrong, given so many 0 acc.

resnet18_imagenet_4bit_log(2).log

jiaxiang-wu commented 5 years ago

@brisker Potential bug. We are looking at this issue.

brisker commented 5 years ago

besides, If I run the resnet_cifar experiments(still RL-based quantization), the acc_top1 is normal, but the acc_top5 is all 0. I assume this is a minor bug. @haolibai @jiaxiang-wu

jiaxiang-wu commented 5 years ago

@brisker We noticed that the accuracy recovers quickly after the RL-based hyper-parameter optimization stage finishes and the final re-training process starts. Maybe something went wrong in the fast fine-tuning code for RL-based hyper-parameter optimization. Anyway, thanks for your feedback.

brisker commented 5 years ago

@jiaxiang-wu but if I run the resnet_cifar experiment, although the printed acc_top5 is all 0, but the acc_top1 is very high, around 85%~100%.

jiaxiang-wu commented 5 years ago

@brisker This is as expected. For ModelHelper classes defined on CIFAR-10, the acc_top5 is not computed and UniformQuantLearner directly sets it to zero in: https://github.com/Tencent/PocketFlow/blob/master/learners/uniform_quantization/learner.py#L297

brisker commented 5 years ago

@jiaxiang-wu I mean the the acc_top1 of resnet-cifar experiment is very high,but the acc of resnet-imagenet has a lot of 0s. Can this explain that maybe it is not the issue regarding fast_finetuining? Or it is because cifar-dataset is too easy, compared to imagenet?

jiaxiang-wu commented 5 years ago

@brisker I think it should the latter, that CIFAR-10 is too easy so that some bug does not lead to dramatical reduction in the top-1 accuracy.

brisker commented 5 years ago

@jiaxiang-wu I have not carefully read the code. But if there exists too many 0-acc in the RL process, the RL seems to lose its meaning. Previously @haolibai told me that he can get about 50% acc for mobilenet-imagenet, I am curious about that, if there still exists many 0-acc in his log.

jiaxiang-wu commented 5 years ago

@brisker Maybe some recent code modification introduces bugs. If there are too many 0-acc in the RL process, the RL agent is likely to fail in finding the optimal hyper-parameter combination.

brisker commented 5 years ago

@jiaxiang-wu So can you tell me which version of Pocketflow can you reproduce the RL-based quantization performance, as @haolibai claimed?

haolibai commented 5 years ago

It is an earlier version before the release. Let us reproduce the result and inform you later.

brisker commented 5 years ago

@haolibai @jiaxiang-wu what is the difference between bucket_type==split and bucket_type==channel? what is this type in the experiments you have reproduced(the mobilenet 47.1%~53.0%. performance )?

haolibai commented 5 years ago

Without bucketing.

brisker commented 5 years ago

given a conv layer with [k,k,cin,cout]weights float values , and N=k*k*cin*cout represents float values, if without bucketing, all the N values are quantized using only one alpha and beta. if with bucketing with bucket_type==channel, all the Nvalues are quantized using cout alphas and cout betas? Have I understood this right? @haolibai

haolibai commented 5 years ago

Yes correct.

brisker commented 5 years ago

@haolibai If --uql_enbl_rl_agent ==True, the weight quantization bits is learned by rl agent, but what about the activation bits? 32? or equivalent_bits?

haolibai commented 5 years ago

activation_bits remains 32 bits, since quantization with different bits across layers can not be accelerated, which poses no necessity for activation quantization.

brisker commented 5 years ago

I do not quite understand what you mean. Why the different activation bit across layers can not be accelerated? here https://github.com/haolibai/PocketFlow/blob/master/learners/uniform_quantization/bit_optimizer.py#L52 it says Currently only weight bits are inferred via RL. Activations later. So what is your plan in the future, regarding the activation-RL? @haolibai @jiaxiang-wu

haolibai commented 5 years ago

A basic requirement for low-bit inference on most current deep learning framework is that you need to maintain the same precision of bits for all the data. If weights are quantized with different bits, the requirement cannot be satisfied, and thereon cannot be accelerated. Nevertheless, weight quantization still can reduce the model size. However, for activation quantization, it can neither reduce the model size nor be accelerated, which brings no gain. Of course we can do RL based quantization on activations theoretically, but only when it can bring some practical advantages will we implement it.

brisker commented 5 years ago

@haolibai @jiaxiang-wu Will Pocketflow support RL-agent reward signal directly from hardware, like lantency or energy consumption?

haolibai commented 5 years ago

I am not sure.

brisker commented 5 years ago

@haolibai so until now, is there bugs found in the quantization code, regarding the RL-based method?

haolibai commented 5 years ago

Not yet, we are still working on it. BTW, you're also welcomed to help us debug the RL optimization part.

brisker commented 5 years ago

@haolibai It seems that there is no corresponding paper about the RL-based quantization part. Could you please provide any documention about the RL-based quantization algorithm? Or any related paper?

haolibai commented 5 years ago

We refer to http://openaccess.thecvf.com/content_ECCV_2018/papers/Yihui_He_AMC_Automated_Model_ECCV_2018_paper.pdf to realize the RL based quantization. Please check.

Edit: Although this paper uses reinforcement learning to optimize each layer's pruning ratio for channel pruning algorithm, rather than uniform quantization, its idea of training an RL agent to optimize layer-wise hyper-parameters is adopted in our RL-based quantization algorithm.

brisker commented 5 years ago

@haolibai @jiaxiang-wu

  1. Is this pic (from https://www.jiqizhixin.com/articles/2018-09-17-6 )corresponding to the RL-based quantization performance for imagenet-resnet18? default

  2. Did this pic mean that w4a32 quantization model with no RL only has about 63% top1-accuracy on imagenet, and with RL the acc can rise up to 68?

  3. Are these performance with channel bucket?

haolibai commented 5 years ago
  1. Yes.
  2. In later experiment we found that w4a32 without RL can also achieve acc up to 66~67 by decreasing the learning rate properly. The RL based results have some variance, but can generally achieve comparable or better results.
  3. Without bucketing.
brisker commented 5 years ago

@haolibai why is the lr always very very small(usually) , no matter it is RL-based method or not. In the log you sent me here, the learning rate start from 2e-6. Is it necessary to set it as small as 2e-6 ? Or the printed 2e-6 is not the "real " learning rate?