Closed brisker closed 5 years ago
Yes. We have some preliminary results of MobileNet on Imagenet dataset, with uql_evquivalent_bits=4
or nuql_evquivalent_bits=4
. ( 8 bit quantization does not hurt the performance, while 2 bit degrade the performance significantly).
For uql_evquivalent_bits=4
, the top-1 acc is 47.1%~53.0%.
For nuql_evquivalent_bits=4
, the top-1 acc is 64.5%~65.9%.
The accuracy on validation set is measured at a fixed interval. Note that the variance of acc could be large sometimes. The above results may be improved if hyper-parameters of the RL agent are more carefully designed.
@haolibai Why is the non-uniform quantization performance much better than uniform quantization? In standard quantization methods, it seems to be not like this.
Generally non-uniform quantization could have larger model capacity and thereon can outperform uniform quantization, especially in low-bit settings.
How to reproduce the RL-based quantization performance you have claimed? Besides, your claimed performace is more like a range, "47.1%~53.0%." So what is the difference between the experiment settings of 47.1 and 53.0? @haolibai
The model is evaluated every 10000 steps, so there could be some variance on the accuracy. We found that for mobilenet, the variance is a little bit large (e.g., 47.1%~53.0%). Remain the default parameters for the RL agent, i.e., run ./scripts/run_seven.sh nets/mobilenet_at_ilsvrc12_run.py \ --learner uniform \ --uql_enbl_rl_agent
@haolibai
@haolibai
I have run this command:(with default settings)
bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4
, and in the first several iterations, the acc is so low(see the pic below). Is this normal?
(the log says "rostoring parameters from ./models/model.ckpt-166818", indicating it has already loaded the full-precision parameters)
@jiaxiang-wu @haolibai it has already run for over 6 roll-out, the acc is still very low. Is this normal? I just follow the default settings.
This could be possible in initial roll-outs. You may increase the learning rate a little so as to effectively verify the current strategy. Every time the model will be restored from a full-precision one and quantize it given the current strategy to see how good it performs. The best performance throughout 200 roll-outs will be recorded, which is the best strategy correspondingly.
@haolibai But I just follow the default settings. ... Did you mean the default settings have bugs?.... Now it has enters the 10th roll-out, still nearly 0 acc.
The first 25% epochs are just for collecting training data for the buffer of DDPG, while the RL agent has not started training. The searching space is a little bit large, and MobileNet-v1 is indeed sensitive to low-bit quantization, so in most cases if the strategy is not good enough, the result will be pretty poor.
If I remember correctly, I generate the above results using the default settings. Nevertheless, the default setting may be not the best setting for the training.
@haolibai
Besides, I have used 4 gpus, but when running nvidia-smi
, the actual gpu-util of the last 3 gpus is actually 0%. So does the RL-based quantization actually support multi-gpu training?
The training of RL agent will use only one GPU, while for training the network parameters will use 4 GPUs. Please check if you successfully use 4 GPUs when using other learners?
@haolibai
Currently I am only using the RL-based UniformQuantization learner. The mobilenet-v1 RL-based model has run over 120 roll-out, and the acc is still nearly 0. I just follow the default settings with the following command:
bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4
I am wondering if the version of the code has some changes, given that you mentioned you have got a 47.1%~53.0% performance with mobilenet-v1 default settings.
(I got successfully run a resnet20-cifar model with RL-based 4-bit quantization training, with around 74.6% top-1 acc, which is still poor but not nearly 0, with the following command:
bash ./scripts/run_local.sh nets/resnet_at_cifar10_run.py --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4)
Did you set enbl_warm_start=True
? Since the quantized network is based on a pretrained model.
Sorry this should be added in the tutorial.
previously no. But just a moment ago, as you have mentioned, I added this flag into the commandand note that this time I set the equivalent_bits=8) and run it, but still, in the first several iterations, the acc is still nearly 0, . It seems to be no difference. So I assume that there must be other possible mistakes in the default settings. @haolibai
I tried a moment ago, using the following configurations, and there are effective accs.
./scripts/run_seven.sh nets/resnet_at_ilsvrc12_run.py -n=8 \ --learner uniform \ --enbl_warm_start \ --uql_enbl_rl_agent \ --uql_equivalent_bits 4 \ --uql_w_bit_min 2 \ --uql_w_bit_max 8
(seven is just a distributed system, you can replace it with your own GPUs) Please make sure that you have successfully restored a pretrained MobileNet-v1, and correct hyper-parameters.
@haolibai
but what I refer to is mobilenet_v1, and the command I run is
bash ./scripts/run_local.sh nets/mobilenet_at_ilsvrc12_run.py -n=4 --learner=uniform --uql_enbl_rl_agent=True --uql_equivalent_bits=4
@brisker The key is that you need to enable warm-start by specifying --enbl_warm_start
.
The following are just the default settings, you can ignore them: --uql_w_bit_min 2 --uql_w_bit_max 8
@haolibai The problem seems not to be totally solved. I noticed that even though the allocated layer bits is a lot higher than 2-bit or 3-bit, the acc is still nearly 0. Here is my log. Could you please check it ? I just can not figure out what is going wrong, given so many 0 acc.
@brisker Potential bug. We are looking at this issue.
besides, If I run the resnet_cifar experiments(still RL-based quantization), the acc_top1 is normal, but the acc_top5 is all 0. I assume this is a minor bug. @haolibai @jiaxiang-wu
@brisker We noticed that the accuracy recovers quickly after the RL-based hyper-parameter optimization stage finishes and the final re-training process starts. Maybe something went wrong in the fast fine-tuning code for RL-based hyper-parameter optimization. Anyway, thanks for your feedback.
@jiaxiang-wu but if I run the resnet_cifar experiment, although the printed acc_top5 is all 0, but the acc_top1 is very high, around 85%~100%.
@brisker
This is as expected. For ModelHelper
classes defined on CIFAR-10, the acc_top5 is not computed and UniformQuantLearner
directly sets it to zero in:
https://github.com/Tencent/PocketFlow/blob/master/learners/uniform_quantization/learner.py#L297
@jiaxiang-wu I mean the the acc_top1 of resnet-cifar experiment is very high,but the acc of resnet-imagenet has a lot of 0s. Can this explain that maybe it is not the issue regarding fast_finetuining? Or it is because cifar-dataset is too easy, compared to imagenet?
@brisker I think it should the latter, that CIFAR-10 is too easy so that some bug does not lead to dramatical reduction in the top-1 accuracy.
@jiaxiang-wu I have not carefully read the code. But if there exists too many 0-acc in the RL process, the RL seems to lose its meaning. Previously @haolibai told me that he can get about 50% acc for mobilenet-imagenet, I am curious about that, if there still exists many 0-acc in his log.
@brisker Maybe some recent code modification introduces bugs. If there are too many 0-acc in the RL process, the RL agent is likely to fail in finding the optimal hyper-parameter combination.
@jiaxiang-wu So can you tell me which version of Pocketflow can you reproduce the RL-based quantization performance, as @haolibai claimed?
It is an earlier version before the release. Let us reproduce the result and inform you later.
@haolibai
@jiaxiang-wu
what is the difference between bucket_type==split and bucket_type==channel
?
what is this type in the experiments you have reproduced(the mobilenet 47.1%~53.0%. performance )?
Without bucketing.
given a conv layer with [k,k,cin,cout]
weights float values , and N=k*k*cin*cout
represents float values,
if without bucketing, all the N
values are quantized using only one alpha and beta.
if with bucketing with bucket_type==channel
, all the N
values are quantized using cout
alphas and cout
betas?
Have I understood this right?
@haolibai
Yes correct.
@haolibai
If --uql_enbl_rl_agent ==True,
the weight quantization bits is learned by rl agent, but what about the activation bits?
32
? or equivalent_bits
?
activation_bits
remains 32 bits, since quantization with different bits across layers can not be accelerated, which poses no necessity for activation quantization.
I do not quite understand what you mean. Why the different activation bit across layers can not be accelerated? here https://github.com/haolibai/PocketFlow/blob/master/learners/uniform_quantization/bit_optimizer.py#L52 it says Currently only weight bits are inferred via RL. Activations later. So what is your plan in the future, regarding the activation-RL? @haolibai @jiaxiang-wu
A basic requirement for low-bit inference on most current deep learning framework is that you need to maintain the same precision of bits for all the data. If weights are quantized with different bits, the requirement cannot be satisfied, and thereon cannot be accelerated. Nevertheless, weight quantization still can reduce the model size. However, for activation quantization, it can neither reduce the model size nor be accelerated, which brings no gain. Of course we can do RL based quantization on activations theoretically, but only when it can bring some practical advantages will we implement it.
@haolibai @jiaxiang-wu Will Pocketflow support RL-agent reward signal directly from hardware, like lantency or energy consumption?
I am not sure.
@haolibai so until now, is there bugs found in the quantization code, regarding the RL-based method?
Not yet, we are still working on it. BTW, you're also welcomed to help us debug the RL optimization part.
@haolibai It seems that there is no corresponding paper about the RL-based quantization part. Could you please provide any documention about the RL-based quantization algorithm? Or any related paper?
We refer to http://openaccess.thecvf.com/content_ECCV_2018/papers/Yihui_He_AMC_Automated_Model_ECCV_2018_paper.pdf to realize the RL based quantization. Please check.
Edit: Although this paper uses reinforcement learning to optimize each layer's pruning ratio for channel pruning algorithm, rather than uniform quantization, its idea of training an RL agent to optimize layer-wise hyper-parameters is adopted in our RL-based quantization algorithm.
@haolibai @jiaxiang-wu
Is this pic (from https://www.jiqizhixin.com/articles/2018-09-17-6 )corresponding to the RL-based quantization performance for imagenet-resnet18?
Did this pic mean that w4a32 quantization model with no RL only has about 63% top1-accuracy on imagenet, and with RL the acc can rise up to 68?
Are these performance with channel bucket?
Did you have the performance summary of the RL based quantization performance on MobileNet? Given that you have described about this here : https://pocketflow.github.io/reinforcement_learning/