Quantisation and Pruning Support

karanchahal commented 5 years ago

Is your feature request related to a problem? Please describe. Nowadays, there is a need to take the floating point models that have been trained and deploy them to edge devices. One way that is popular is to quantise the weights and activation os a neural network to a lower bit width (eg: 8 bits or even 4 bits). The benefits of this are 2 fold:

Some accelerators perform computation at lower bit widths much faster than fp16 or fp32 computation.
The model takes less space, and the savings increase by a substantial factor every time we reduce a bit from the tensor data type.

People have tried other means to compress a model, one of them is pruning. Pruning basically means that some of the weights of a neural network are zero, hence we seek to introduce sparsity in the network.

The benefits of this are that you potentially do not have to perform the useless multiplications with zeros hence providing a potential computation saving. Research has shown that even after pruning ~80% of weights (this is fine grained pruning), the network preserves it's accuracy . This is a very surprising result. Course grained pruning (setting all weights of a channel to zero) also works to an extent but results in significantly more accuracy loss. This is an active research area.

Describe the solution you'd like Generally how quantisation works is through the use of a scale value and a zero point value, so each quantised tensor needs to have the quantised tensor, it's scale and zero point. The scale and zero point are needed to convert to and from quantised and dequantized tensors.

There are 2 ways to quantize a model:

Post training quantisation: Quantises a trained model, no retraining required (works well for down to 8 bits).
Quantisation Aware Training: A way to train a model to induce robustness to quantisation. (It works well for aggressive quantizations schemes (down to 4 bits))

I have successfully implemented the post training quantisation algorithms and was able to get a quantised MNIST model down to 8 bits with next to no accuracy loss. Going down to 4 bits resulted in the model diverging.I am currently working on quant aware training as of now. If you want to see how post train quantisation works, please check out this Google colab notebook.

Now, let's come to pruning:

Pruning is a very general thing, there could be a lot of ways to perform it. As far as I know, there is generally a "pruning schedule". The researcher decided when to prune how many percent of weights (aka the degree of sparsity of the layer). Now, they could prune some layers, leave some as is. Slowly increase the sparsity degree of the pruned players with time during training. There are also different types of pruning, a structured way to prune weights (eg: take off full channels of a conv kernel or reduce a dimension of a fully connected layer by 1) or an unstructured way to prune (randomly zero out weights). Lightning could potentially offer a structured and unstructured way to prune to help out researchers. If you would like to see pruning in action, I have tried pruning out on an MNIST model by using the Google paper algorithm, "To Prune or not to Prune". It is unstructured pruning with 90% sparsity and I was able roughly the same accuracy as the un-pruned model. This is the Google Colab link for it.

Describe alternatives you've considered Right now Pytorch doesn't have quantization and pruning support however, that is in the works. We could either wait for them to complete their work or we could implement a small library by ourselves.

What use case I was trying to target is lightning could become a playground where researchers could test out quantisation and pruning on their models and potentially could implement novel algorithms through it's base support.

Additional context If any of you want to learn more about quantization, I have embedded the resources I learnt from below. They were indeed invaluable.

Jacob Benoit et al’s Quantisation Paper (Google) Raghuraman’s Paper on Quantisation (Google, he’s now at Facebook) Distiller Docs on Quantisation Gemmlowp’s Quantisation Tutorial

williamFalcon commented 5 years ago

@karanchahal this sounds great. let's add both and we can use the official PyTorch version when it's ready!

The first one as a trainer option: Trainer(quantize_bits=4)

The second after training which can be called on Module.

trainer.fit(model)

model.quantize(bits=8)

@karanchahal submit a PR and we can walk through the implementation!

shivamsaboo17 commented 5 years ago

@karanchahal can you please check the link you provide for pruning notebook. I think it's the same link for quantization notebook. Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?

karanchahal commented 5 years ago

Hello,

This conversation between me an Tim Dettmers might interest you in the challenges of attaining real world speed ups with sparse weights. https://github.com/TimDettmers/sparse_learning/issues/1

My apologies on the wrong link, I'll update it soon and let you know.

Best, Karan

On Tue, Aug 13, 2019, 20:09 Shivam Saboo notifications@github.com wrote:

@karanchahal https://github.com/karanchahal can you please check the links you provide for pruning notebook. I think it's the same link for quantization notebook. Also, regarding the implementation of neural network pruning, I found that masking the weights that we need to prun is very simple to implement, but if we still keep the weight tensors as the same datatype as before, we still have to do entire matrix multiplication. While multiplications with 0's take less time, still I believe this is really inefficient when you prun 90% of weights but still have to do full matrix multiplication. Are you familiar with a way to handle sparse weights more efficiently in pytorch or some other way such that we can re-structure the network based on prunned weights (assuming unstructured pruning)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/williamFalcon/pytorch-lightning/issues/76?email_source=notifications&email_token=ADEXT7QDB4Q32S4XBAA6OR3QELBQNA5CNFSM4IKIZYA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4F32WI#issuecomment-520863065, or mute the thread https://github.com/notifications/unsubscribe-auth/ADEXT7WI7TP6J4KNHSD4BWLQELBQNANCNFSM4IKIZYAQ .

shivamsaboo17 commented 5 years ago

Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors. But I was really interested in implementing custom layers in PyTorch for just inference (only writing forward pass perhaps using torch.sparse API) once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.

karanchahal commented 5 years ago

Hey sure, I was quite interested in this actually. Some great work has been done on fast sparse kernels (link https://openreview.net/forum?id=rJPcZ3txx, link https://arxiv.org/abs/1702.08597, link https://arxiv.org/abs/1802.10280), but it's certainly an area of active research.

I haven't read these papers but I've heard this is a good place to start. Let's read this and then revert back here with what we've learnt ?

Best, Karanbir Chahal

On Wed, Aug 14, 2019, 14:19 Shivam Saboo notifications@github.com wrote:

Thanks for the reply! I was too unaware of so many challenges of working on sparse tensors. But I was really interested in implementing custom layers in PyTorch for just inference once we have all the boolean mask. Would you be interested in collaborating on implementing such layers? Perhaps we can start specifically for linear layers and then extend to other types of layers.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/williamFalcon/pytorch-lightning/issues/76?email_source=notifications&email_token=ADEXT7QCPFYM5ZEV7KQ6U5LQEPBK7A5CNFSM4IKIZYA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4IELEQ#issuecomment-521160082, or mute the thread https://github.com/notifications/unsubscribe-auth/ADEXT7XGO2FROWOKPU4MUXLQEPBK7ANCNFSM4IKIZYAQ .

shivamsaboo17 commented 5 years ago

Great! Will start reading these papers.

shivamsaboo17 commented 5 years ago

I read through the ICLR 17 paper and implemented their algorithm in python (link to colab). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy). If you have any suggestions please let me know!

karanchahal commented 5 years ago

This is pretty interesting ! Great work!

I see you're using numba to run it on the GPU if I'm not mistaken.

I wonder if numba converts the python loops into C/C++, if not using C++ extensions might be a worthwhile exercise.

I was also wondering if combining cython with numba would be the easier way to go for that?

The speed increase is definitely encouraging. I think tuning this implementation could get us below 4 ms. Btw what do the pytorch people use to do the conv2d, is it plain im2col or something fancy like a Winograd algorithm? Mostly I feel they must have really optimised the loading and unloading of data to and fro from the GPU. We'll have a tough time getting a better speed than cudnn's super optimised implementation. But definitely worth trying !

I've been traveling a lot this week and have been unable to read the papers or code :/ I'll try to read up soon and study your implementation.

On another note, good news is that I've almost got quant aware training working ( inference in 4 bits ! ).

Apologies again for the late response :)

Best, Karanbir Chahal

On Fri, Aug 16, 2019, 00:34 Shivam Saboo notifications@github.com wrote:

I read through the ICLR 17 paper https://openreview.net/forum?id=rJPcZ3txx and implemented their algorithm in python (link to colab https://colab.research.google.com/drive/1MpDzO70S--zGDWjpcwx7uBgSDunKkDhy). It is not the most efficient implementation as I used python loops to implement their algorithm, but the key takeaway is the speed increase when sparsity in the weights increase, whereas the PyTorch conv2d need almost same time for all sparsity levels (even all 0's weights). I will try to implement the algorithm using PyTorch C++ extension functionality perhaps (haven't worked on it before), but before that I need to figure out how to use CSR sparse matrix in PyTorch (currently I am using scipy). If you have any suggestions please let me know!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/williamFalcon/pytorch-lightning/issues/76?email_source=notifications&email_token=ADEXT7R5ARVDPQOTVOTKBULQEWSE3A5CNFSM4IKIZYA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MWQPA#issuecomment-521758780, or mute the thread https://github.com/notifications/unsubscribe-auth/ADEXT7SHPY2DBF4I447IIMDQEWSE3ANCNFSM4IKIZYAQ .

shivamsaboo17 commented 5 years ago

I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.

I too think that I should first try to make the implementation work with cython and numba before C++ implementation.

Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.

Will try out a few things this weekend and let you know if I get any improvements

karanchahal commented 5 years ago

Ahh okay, well pytorch has the Torchscript thing that we can try as well.

It uses a jit too and applies the optimizations for pytorch tensors. I don't know it's possible to get it working with scipy sparse format.

Can we use the sparse tensor format (COO) instead of the one scipy uses ?

Thanks again for this great work !

Best, Karan

On Fri, Aug 16, 2019, 20:03 Shivam Saboo notifications@github.com wrote:

I used the numba jit decorator for sparse convolution function and it runs on CPU (implemented using scipy sparse arrays). I felt it would convert python loops to C++ but when I use nopython=True to compile entire function I get an error because it cannot recognize scipy sparse matrix format and is treated as a regular python object.

I too think that I should first try to make the implementation work with cython and numba before C++ implementation.

Regarding pytorch's conv I think it uses im2col but not sure. But I too think if we can somehow implement this paper's algorithm using torch's inbuilt functions and/or optimize the loops we can get faster layer.

Will try out a few things this weekend and let you know if I get any improvements

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/williamFalcon/pytorch-lightning/issues/76?email_source=notifications&email_token=ADEXT7QBGMB4N2YS2G77CJ3QE23EPA5CNFSM4IKIZYA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4OY5GQ#issuecomment-522030746, or mute the thread https://github.com/notifications/unsubscribe-auth/ADEXT7XSLAOT3ZZ2EHREWDDQE23EPANCNFSM4IKIZYAQ .

shivamsaboo17 commented 5 years ago

The paper actually metions using CSR format as row slicing is very fast. Not sure if COO format would be as efficient but we can try. Although converting from COO to CSR should be possible (but not sure how) with small computational overhead

williamFalcon commented 5 years ago

@shivamsaboo17 @karanchahal https://gitter.im/PyTorch-Lightning/community?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge

williamFalcon commented 5 years ago

super excited about this feature!

shivamsaboo17 commented 5 years ago

@karanchahal @williamFalcon I ported the pure python code to cython and got significant speedups: My experiments are on 3x64x64 input tensor and filters size is 256x3x3x3 Pure python: 50% sparse --> 45 seconds 90% sparse --> 11 seconds 100% sparse --> 60ms

Cython optimized: 50% sparse --> 13 ms 90% sparse --> 5 ms 100% sparse --> 661 microseconds

For ref: PyTorch conv2d took 1.9 ms on my machine (CPU). (Prev results were on colab(CPU))

google drive link to .pyx and ipynb file: https://drive.google.com/open?id=1gnrbFNWJBZbyPH6KKnCLmrPBNqOFtKUD https://drive.google.com/open?id=1--_B89H4iSZuJuj9QKqBRrB5Tlr7DMnH

Link to compiled C file: https://drive.google.com/open?id=1nCGKRmM4AGcmepEJCkWAl_SBZc2l-rrA

I am looking at more ways to optimize cython code now.

williamFalcon commented 5 years ago

@sidhanthholalkere @karanchahal spoke with @soumith about this. I think this is better added to core PyTorch. Check out this issue.

Once it's merged and live there we can do whatever we need to do to support it.

Closing to move this work to the PyTorch issue.

gottbrath commented 5 years ago

Note that we have a notebook with a preview tutorial on eager mode post training quantization in core pytorch over in https://github.com/pytorch/pytorch/issues/18318 ... please check it out and leave feedback.

Lightning-AI / pytorch-lightning

Quantisation and Pruning Support #76