Open brisker opened 5 years ago
Dear DuWang:
Thanks for your comment and interest in my work. I am not going to change the readme file because I like it to be simple. However, I will try to answer your questions here:
Which python files to run? There are three steps as you probably are aware from reading the paper: 1.1: Run train_baseline/train_probe_save.py in order to train the baseline and collect all the statistics needed for the precision analysis. Basically this saves all statistics needed to compute all equations in claim 1. Make sure you prepare target folders for data dump as needed - this you can easily figure out from reading the code or if you get a certain error at compilation time. 1.2: To determine precisions now we need to run the other py files. The remaining py files in the folder train_baseline/ can be used to compute precisions of gradients (as well as dynamic range). Each file name has information about which precision it is computing. Note that those files will utilize the data you would have dumped when running 1.1 above. Feel free to change the criteria if you find the analysis too conservative. Then move your model files to FF_analysis/ folder. In this folder you can use FFAnalysis to compute the quantization noise gains. Then use those to compute the precision offsets using the simple equation in the paper which is implemented in precision_offset.py. Finally, use quantized_inference.py to determine the minimum precision according to your p_m budget. Now we have all precisions except those of weight accumulators. applying criterion 5 is very easy and you can use train-internal/determineAccPrecision.py once you have selected your learning rate and have already set weight and weight gradient precisions. 1.3: Run training in fixed-point using train-internal/train_quantized.py Note that inside this file you will find placeholders for precisions of all tensors except those of activation gradients. Because of the implementation, those need to be set inside the file quantizeGradPredicted.py.
Theano version: not sure, but I guess the latest one. Note that the Theano guys stopped developing it so whatever was the latest one at the time I prepared this work is still equivalent. In other words, I am pretty sure the latest version would work. Python version: 2 Where to put the data: As you know, there are several instances where data is being dumped, it is very easy for you to figure this out by trying to find statements in the codes which are saving data. It will be pretty easy for you to determine which folders and files will correspond to the data being dumped.
To generate the results in Table 1, you simply need to modify train-quantized.py. Note that steps 1.1 and 1.2 will probably no longer be needed. For instance, for BN everything is binary, except weights and activations which are binary (make sure to use -1 and +1 as levels as this is what the BN guys do) and make sure to use the full precision accumulators the way it is described in their paper. Another example, TG, just keep everything in full precision (don't use accumulators) and quantize gradients to 2 bits. You will have to scale according to 2.5*sigma the way they do in their paper. For SQ, use full precision accumulators but also use stochastic rounding. This one is pretty trivial.
Please let me know if you have further specific questions.
Best, Charbel
@charbel-sakr
Hi:
@charbel-sakr So did you mean if I do not need to quantize gradients, I can simply turn to this PyTorch project here?(https://github.com/charbel-sakr/Precision-Analysis-Pytorch)
If so, could you please tell me that
Hi: Yes, this is what I meant, yes you can use the PyTorch code but will have to add training code yourself.
In ICML17 paper I was only considering same precision in all the network. I was not considering per-layer precision assignment. This came in the ICASSP18 paper I shared above. As you can see, the methodology, is quite similar to that of ICLR19 as far as feedforward precisions are concerned.
You would have to write your own training code with quantization in PyTorch. I have not done that already but I believe it should not be very hard.
I don't think there will be much difference since the two other repos on precision analysis for inference, one using Theano the other Pytorch, yield very consistent results.
in train_quantized.py, disable: quantizeGrad.quantizeGradLX function call everywhere you see it (here X is the layer index) - this takes care of activation gradient quantization g = layers.quantizeNormalizedWeight(g,B[l],scale[l]) on line 221 - this takes care of weight gradient quantization replace ayers.quantizeNormalizedWeight(remainder,BAcc[l],DRAcc[l]) by remainder on line 227 - this takes care of accumulator quantization
Dear Charbel Thank you for this very interesting paper. Do you have plan to implement the code in Pytorch and Python 3?
Dear @yuginhc
Thank you for your interest in my paper! Currently it is not in my plan to re-implement in Pytorch in Python 3 as I am swamped with other projects. However if I do find the time to do so, I will most likely post the code in a new repo and let you know. Thanks!
Dear Charbel,
Thank you for your amazing idea on per-tensor fixed-point quantization. However, from my perspective, it's ambiguous on the constant choice for activation gradients in Lemma 2 of Proof of Claim 1. The constant you chose for weight gradients is strictly proved, but when you decide the constant for activation gradients, I think there is no enough reason.
In your paper, you explained
The fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment.
But why the constant 4 is enough larger? What is the reason for the choice? How do you confirm 4 is ok?
Wish for your reply. Thanks.
Regards, Chao
Hello Chao,
Thanks for your question, it is a good question. The reason why the constant needs to be multiplied by a factor of 2 is because the rectifying activations (e.g., ReLU) do not affect the dynamic range but cause the variance to be divided by a factor of 2. You can think of it as the distribution of the activations as being some gaussian (to which the same analysis on weight gradients applies) mized with one cluster (a dirac delta) at zero. I hope this makes sense. Thank you for your interest!
Best Regards, Charbel Sakr.
On Wed, Jul 10, 2019 at 10:33 PM Chao Fang notifications@github.com<mailto:notifications@github.com> wrote:
Dear Charbel,
Thank you for your amazing idea on per-tensor fixed-point quantization. However, from my perspective, it's ambiguous on the constant choice for activation gradients in Lemma 2 of Proof of Claim 1. The constant you chose for weight gradients is strictly proved, but when you decide the constant for activation gradients, I think there is no enough reason.
In your paper, you explained
The fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment.
But why the constant 4 is enough larger? What is the reason for the choice? How do you confirm 4 is ok?
Wish for your reply. Thanks.
Regards, Chao
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/charbel-sakr/Fixed-Point-Training/issues/1?email_source=notifications&email_token=AIFISHH7XVTEPJWRZC5GQGLP62SXVA5CNFSM4GYIGHM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVMWLA#issuecomment-510315308, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIFISHDLRBHYCQ27WIEX463P62SXVANCNFSM4GYIGHMQ.
Dear @charbel-sakr ,
Thank you for your clear and helpful explanation!
Best Regards, Chao
Hello @charbel-sakr
Thanks a lot for sharing your codes. I was curious to know how the per-tensor fixed-point quantization will work if we remove batch norm in training. Could you please provide your comments on how it would affect the overall training? would the network converge with fixed-point quantization after removing batch norm?
thanks and regards Shreyas
Dear Shreyas,
Thanks for reaching out. Your question is very interesting, however, I am afraid I have not tried it out (train without batchnorm). In general, batchnorm is so beneficial for training that I have not considered training without it. From a theoretical perspective, note that my method is orthogonal to batchnorm so I do not think there should be any problem as far as convergence is concerned. However, one has to first find a network that converges without batchnorm. I hope this answers your question.
Best Regards, Charbel Sakr.
On Sun, Jul 28, 2019 at 7:58 PM Shreyas Kolala Venkataramanaiah notifications@github.com<mailto:notifications@github.com> wrote:
Hello @charbel-sakrhttps://github.com/charbel-sakr
Thanks a lot for sharing your codes. I was curious to know how the per-tensor fixed-point quantization will work if we remove batch norm in training. Could you please provide your comments on how it would affect the overall training? would the network converge with fixed-point quantization after removing batch norm?
thanks and regards Shreyas
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/charbel-sakr/Fixed-Point-Training/issues/1?email_source=notifications&email_token=AIFISHAXIW7JY7WTKKZCKLDQBY6D3A5CNFSM4GYIGHM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27KNNA#issuecomment-515810996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIFISHDXRRSRFKZHZ5QK2LLQBY6D3ANCNFSM4GYIGHMQ.
Dear Charbel
Thanks a lot for your quick response.
regards Shreyas
Could you please provide more detailed Readme file on how to reproduce your paper results, given so many .py files in your project?