charbel-sakr / Fixed-Point-Training

Code needed to reproduce results from my ICLR 2019 paper on fixed-point quantization of the backprop algorithm.
9 stars 1 forks source link

Could you please provide more detailed Readme file on how to reproduce your paper results, given so many .py files in your project #1

Open brisker opened 5 years ago

brisker commented 5 years ago

Could you please provide more detailed Readme file on how to reproduce your paper results, given so many .py files in your project?

  1. Which python files to run?
  2. Theano version? Python version? Where to put the data?
  3. How to generate the results in Table 1 in your paper, for example, CIFAR-10 ResNet with FX , BN, SQ and TG configuration?
charbel-sakr commented 5 years ago

Dear DuWang:

Thanks for your comment and interest in my work. I am not going to change the readme file because I like it to be simple. However, I will try to answer your questions here:

  1. Which python files to run? There are three steps as you probably are aware from reading the paper: 1.1: Run train_baseline/train_probe_save.py in order to train the baseline and collect all the statistics needed for the precision analysis. Basically this saves all statistics needed to compute all equations in claim 1. Make sure you prepare target folders for data dump as needed - this you can easily figure out from reading the code or if you get a certain error at compilation time. 1.2: To determine precisions now we need to run the other py files. The remaining py files in the folder train_baseline/ can be used to compute precisions of gradients (as well as dynamic range). Each file name has information about which precision it is computing. Note that those files will utilize the data you would have dumped when running 1.1 above. Feel free to change the criteria if you find the analysis too conservative. Then move your model files to FF_analysis/ folder. In this folder you can use FFAnalysis to compute the quantization noise gains. Then use those to compute the precision offsets using the simple equation in the paper which is implemented in precision_offset.py. Finally, use quantized_inference.py to determine the minimum precision according to your p_m budget. Now we have all precisions except those of weight accumulators. applying criterion 5 is very easy and you can use train-internal/determineAccPrecision.py once you have selected your learning rate and have already set weight and weight gradient precisions. 1.3: Run training in fixed-point using train-internal/train_quantized.py Note that inside this file you will find placeholders for precisions of all tensors except those of activation gradients. Because of the implementation, those need to be set inside the file quantizeGradPredicted.py.

  2. Theano version: not sure, but I guess the latest one. Note that the Theano guys stopped developing it so whatever was the latest one at the time I prepared this work is still equivalent. In other words, I am pretty sure the latest version would work. Python version: 2 Where to put the data: As you know, there are several instances where data is being dumped, it is very easy for you to figure this out by trying to find statements in the codes which are saving data. It will be pretty easy for you to determine which folders and files will correspond to the data being dumped.

  3. To generate the results in Table 1, you simply need to modify train-quantized.py. Note that steps 1.1 and 1.2 will probably no longer be needed. For instance, for BN everything is binary, except weights and activations which are binary (make sure to use -1 and +1 as levels as this is what the BN guys do) and make sure to use the full precision accumulators the way it is described in their paper. Another example, TG, just keep everything in full precision (don't use accumulators) and quantize gradients to 2 bits. You will have to scale according to 2.5*sigma the way they do in their paper. For SQ, use full precision accumulators but also use stochastic rounding. This one is pretty trivial.

Please let me know if you have further specific questions.

Best, Charbel

brisker commented 5 years ago

@charbel-sakr

  1. Regarding real-world applications, if I only need to search the best configuration of weights and activation bits precision, with all the gadients and weights accmulator being float32(which means do not search gradients and weights accumulated bit-width precision), does your paper and code support this configuration? If yes, how to configure this setting?
  2. Why there is no imagenet experiments in this paper?
charbel-sakr commented 5 years ago

Hi:

  1. In this case, just use the code in FF_Analysis and disable quantization of gradients and accumulators in train_quantized.py. Note that this case is very similar to my other repo on quantized inference. You may want to check my ICML and ICASSP papers.
  2. That is because I did not have very strong computational resources at the time of preparation of the paper. I am in academia and work with one P100 GPU. I tried to compensate for lack of imagenet results by doing a very comprehensive analysis on many datasets/networks. This is pretty acceptable for venues like ICML/NIPS/ICLR and the reviewers did agree that the empirical results were good enough. If you wish to try my work on imagenet and have the computational resources to do so, I would be extremely happy!
brisker commented 5 years ago

@charbel-sakr So did you mean if I do not need to quantize gradients, I can simply turn to this PyTorch project here?(https://github.com/charbel-sakr/Precision-Analysis-Pytorch)

If so, could you please tell me that

  1. Are you sure your ICLR19 paper with no gradients quantized very similar to your ICML17 paper? In the Table 1 in your ICML paper, why is the precision assignment fixed as (8,8), (6,6), (6,9), (4,7), but not varied across different layers? I can not see the quantization precision assignment process.
  2. How to reproduce the results in Table 1 in ICLR19 paper(or similar results) using that PyTorch repo?since I am more familiar with PyTorch. If I know how to use the pytorch codes, I have enough computation resources and would be willing to do some imagenet experiments based on the PyTorch codes.
  3. If simply disable gradients and weights accumulator in this repo generate results A, and using that PyTorch repo generate results B, will A and B differs a lot in accuracy? What is their difference?
  4. I am still a little confused of how to disable gradients and accumulators quantization in the theano codes. Could you please describe it more specifically in detail? I want to quickly reproduce some meaningful results so as to add imagenet experiments and better understand the paper.
charbel-sakr commented 5 years ago

Hi: Yes, this is what I meant, yes you can use the PyTorch code but will have to add training code yourself.

  1. In ICML17 paper I was only considering same precision in all the network. I was not considering per-layer precision assignment. This came in the ICASSP18 paper I shared above. As you can see, the methodology, is quite similar to that of ICLR19 as far as feedforward precisions are concerned.

  2. You would have to write your own training code with quantization in PyTorch. I have not done that already but I believe it should not be very hard.

  3. I don't think there will be much difference since the two other repos on precision analysis for inference, one using Theano the other Pytorch, yield very consistent results.

  4. in train_quantized.py, disable: quantizeGrad.quantizeGradLX function call everywhere you see it (here X is the layer index) - this takes care of activation gradient quantization g = layers.quantizeNormalizedWeight(g,B[l],scale[l]) on line 221 - this takes care of weight gradient quantization replace ayers.quantizeNormalizedWeight(remainder,BAcc[l],DRAcc[l]) by remainder on line 227 - this takes care of accumulator quantization

yuginhc commented 5 years ago

Dear Charbel Thank you for this very interesting paper. Do you have plan to implement the code in Pytorch and Python 3?

charbel-sakr commented 5 years ago

Dear @yuginhc

Thank you for your interest in my paper! Currently it is not in my plan to re-implement in Pytorch in Python 3 as I am swamped with other projects. However if I do find the time to do so, I will most likely post the code in a new repo and let you know. Thanks!

fantasysee commented 5 years ago

Dear Charbel,

Thank you for your amazing idea on per-tensor fixed-point quantization. However, from my perspective, it's ambiguous on the constant choice for activation gradients in Lemma 2 of Proof of Claim 1. The constant you chose for weight gradients is strictly proved, but when you decide the constant for activation gradients, I think there is no enough reason.

In your paper, you explained

The fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment.

But why the constant 4 is enough larger? What is the reason for the choice? How do you confirm 4 is ok?

Wish for your reply. Thanks.

Regards, Chao

charbel-sakr commented 5 years ago

Hello Chao,

Thanks for your question, it is a good question. The reason why the constant needs to be multiplied by a factor of 2 is because the rectifying activations (e.g., ReLU) do not affect the dynamic range but cause the variance to be divided by a factor of 2. You can think of it as the distribution of the activations as being some gaussian (to which the same analysis on weight gradients applies) mized with one cluster (a dirac delta) at zero. I hope this makes sense. Thank you for your interest!

Best Regards, Charbel Sakr.


Charbel Sakr PhD Candidate - UIUC sakr2.web.engr.illinois.eduhttp://sakr2.web.engr.illinois.edu

On Wed, Jul 10, 2019 at 10:33 PM Chao Fang notifications@github.com<mailto:notifications@github.com> wrote:

Dear Charbel,

Thank you for your amazing idea on per-tensor fixed-point quantization. However, from my perspective, it's ambiguous on the constant choice for activation gradients in Lemma 2 of Proof of Claim 1. The constant you chose for weight gradients is strictly proved, but when you decide the constant for activation gradients, I think there is no enough reason.

In your paper, you explained

The fact that the true dynamic range of the activation gradients is larger than the value indicated by the second moment.

But why the constant 4 is enough larger? What is the reason for the choice? How do you confirm 4 is ok?

Wish for your reply. Thanks.

Regards, Chao

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/charbel-sakr/Fixed-Point-Training/issues/1?email_source=notifications&email_token=AIFISHH7XVTEPJWRZC5GQGLP62SXVA5CNFSM4GYIGHM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZVMWLA#issuecomment-510315308, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIFISHDLRBHYCQ27WIEX463P62SXVANCNFSM4GYIGHMQ.

fantasysee commented 5 years ago

Dear @charbel-sakr ,

Thank you for your clear and helpful explanation!

Best Regards, Chao

skvenka5 commented 5 years ago

Hello @charbel-sakr

Thanks a lot for sharing your codes. I was curious to know how the per-tensor fixed-point quantization will work if we remove batch norm in training. Could you please provide your comments on how it would affect the overall training? would the network converge with fixed-point quantization after removing batch norm?

thanks and regards Shreyas

charbel-sakr commented 5 years ago

Dear Shreyas,

Thanks for reaching out. Your question is very interesting, however, I am afraid I have not tried it out (train without batchnorm). In general, batchnorm is so beneficial for training that I have not considered training without it. From a theoretical perspective, note that my method is orthogonal to batchnorm so I do not think there should be any problem as far as convergence is concerned. However, one has to first find a network that converges without batchnorm. I hope this answers your question.

Best Regards, Charbel Sakr.


Charbel Sakr PhD Candidate - UIUC sakr2.web.engr.illinois.eduhttp://sakr2.web.engr.illinois.edu

On Sun, Jul 28, 2019 at 7:58 PM Shreyas Kolala Venkataramanaiah notifications@github.com<mailto:notifications@github.com> wrote:

Hello @charbel-sakrhttps://github.com/charbel-sakr

Thanks a lot for sharing your codes. I was curious to know how the per-tensor fixed-point quantization will work if we remove batch norm in training. Could you please provide your comments on how it would affect the overall training? would the network converge with fixed-point quantization after removing batch norm?

thanks and regards Shreyas

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/charbel-sakr/Fixed-Point-Training/issues/1?email_source=notifications&email_token=AIFISHAXIW7JY7WTKKZCKLDQBY6D3A5CNFSM4GYIGHM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD27KNNA#issuecomment-515810996, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIFISHDXRRSRFKZHZ5QK2LLQBY6D3ANCNFSM4GYIGHMQ.

skvenka5 commented 5 years ago

Dear Charbel

Thanks a lot for your quick response.

regards Shreyas