Open Ahmad-Jarrar opened 1 year ago
Thanks for your interests in our work. Could you please try to use [3, 4, 5] to see if there is still this issue? Also, what is the performance of [2]?
I have again tested using the config file provided. Only the bit-widths were changed.
You can see if I run on 5, 4, 3 bits the performance is fine. But if i do it with 4, 3, 2 the performance is much worse on all bit levels.
Hi @Ahmad-Jarrar , sorry for this, the quantization scheme proposed in the paper does not converge for low bits, and some modification is necessary. I remembered I posted this... For proper convergence, it should be better to have vanishing mean for weights, besides proper variance requirements. To this end, you should use the following quantization method:
$$ q = 2 \cdot \frac{1}{2^b}\Bigg(\mathrm{clip}\bigg(\Big\lfloor 2^b\cdot\frac{w+1}{2} \Big\rfloor,0,2^b-1\bigg)+\frac{1}{2}\Bigg) - 1 $$
This will guarantee centered distribution for weights.
The code is something like this:
a = 1 << bit
res = torch.floor(a * input)
res = torch.clamp(res, max=a - 1)
res.add_(0.5)
res.div_(a)
inside the q_k
function.
You could also try this for activation quantization (without applying the outermost remapping 2x-1), but I did not try this before.
I will update the code and readme accordingly.
Best.
Hi @Ahmad-Jarrar , I have updated the readme. Hope it is clear. Thanks again for your interest in our work.
If I'm not wrong, the code given does not apply the outermost 2x-1.
If I'm not wrong, the code given does not apply the outermost 2x-1.
https://github.com/deJQK/AdaBits/blob/master/models/quant_ops.py#L142-L143
Yes I noticed it later. Thank you so much for your help.
Hi @Ahmad-Jarrar , sorry for this, the quantization scheme proposed in the paper does not converge for low bits, and some modification is necessary. I remembered I posted this... For proper convergence, it should be better to have vanishing mean for weights, besides proper variance requirements. To this end, you should use the following quantization method:
q=2⋅12b(clip(⌊2b⋅w+12⌋,0,2b−1)+12)−1
This will guarantee centered distribution for weights.
The code is something like this:
a = 1 << bit res = torch.floor(a * input) res = torch.clamp(res, max=a - 1) res.add_(0.5) res.div_(a)
inside the
q_k
function.You could also try this for activation quantization (without applying the outermost remapping 2x-1), but I did not try this before.
I will update the code and readme accordingly.
Best.
Hello @deJQK , I can't understand the meaning that "For proper convergence, it should be better to have vanishing mean for weights, besides proper variance requirements.". Why is "For proper convergence, it should be better to vanishing mean for weights"? Could you give me a specific explanation? Additionally, this formula isn't match code about:
Looking forward to your reply, thank you.
Hi @haiduo , you could check these papers: https://arxiv.org/pdf/1502.01852.pdf, https://arxiv.org/pdf/1606.05340.pdf, https://arxiv.org/pdf/1611.01232.pdf, all of which analyze training dynamics for centered weight. I am not sure how to analyze weights with nonzero mean.
Hi @haiduo , you could check these papers: https://arxiv.org/pdf/1502.01852.pdf, https://arxiv.org/pdf/1606.05340.pdf, https://arxiv.org/pdf/1611.01232.pdf, all of which analyze training dynamics for centered weight. I am not sure how to analyze weights with nonzero mean.
Thank you for your reply! @deJQK , So "vanishing mean for weights" just added 0.5 after q_k
function, and everything else is the same, right? I can interpret this as converting [-1, 1] to [0,1] to [0, 15] to [0.5, 15.5] to [-0.9375, 0.9375] for b=4, does it correspond to the third picture below "Centered Symmetric"?
It doesn't seem right, I feel confused. If it is convenient, could you please send me the code for the above four diagrams? Maybe I'll understand soon. Thank you very much! You can send me an email 'huanghd@stu.xjtu.edu.cn', I am very interested in your work.
Hi @haiduo, thanks again for your interest. For b=4, it maps [-1, 1] to [0, 1], to {0, 1, ..., 15}, to {0.5, 1.5, ..., 15.5}, to {1/32, 3/32, ..., 31/32}, to {-15/16, -13/16, ..., 13/16, 15/16}. Code for all four schemes is available in the repo and you could check the related lines.
Hi @haiduo, thanks again for your interest. For b=4, it maps [-1, 1] to [0, 1], to {0, 1, ..., 15}, to {0.5, 1.5, ..., 15.5}, to {1/32, 3/32, ..., 31/32}, to {-15/16, -13/16, ..., 13/16, 15/16}. Code for all four schemes is available in the repo and you could check the related lines.
ok,Thank you!
Hi @haiduo, thanks again for your interest. For b=4, it maps [-1, 1] to [0, 1], to {0, 1, ..., 15}, to {0.5, 1.5, ..., 15.5}, to {1/32, 3/32, ..., 31/32}, to {-15/16, -13/16, ..., 13/16, 15/16}. Code for all four schemes is available in the repo and you could check the related lines.
Hi @deJQK , Sorry, one more question, I need you to answer two of my questions about:
q_k
function, and everything else is the same, right?@haiduo, yes for both.
I am trying to run your experiments on CIFAR10 as described in the q_resnet_uint8_train_val.yml . However i am getting poor performance on lower bit widths. I have tried with several tweaks to the config file. The result of the latest experiment is:
I have used these parameters:
Kindly let me know how can I improve the results and what am I doing wrong.