HuangOwen / QAT-ACS

Official PyTorch implementation of paper "Efficient Quantization-aware Training with Adaptive Coreset Selection"
MIT License
22 stars 2 forks source link

why there is no extended experiments on LLMs or large vision transformers? #1

Open brisker opened 1 year ago

brisker commented 1 year ago

why there is no extended experiments on LLMs or large vision transformers?

HuangOwen commented 1 year ago

Thanks for the great question and your interest in our work! I have conducted experiments on DeiT-T and Swin-T on ImageNet, and the results are also promising. I will provide results and codes for the ViTs later.

For the LLMs, as you may know, most previous work is PTQ, which uses a calibration dataset with a small size. For example, In GPTQ, they select calibration data consisting of 128 random 2048 token segments from the C4 dataset. Considering that the size is small, improving data efficiency is not urgent.

Recently, some QAT methods with LLMs have emerged. For example, you can check out LLM-QAT: Data-Free Quantization Aware Training for Large Language Models. We will include the coreset selection of LLM-QAT in our future work.

brisker commented 1 year ago

@HuangOwen I have a further question, since the motivation of this work is to accelerate QAT, have you tried any experiments on gradient quantization, since the backprop process can be computation-intensive, even more intensive than forward inference? Gradient quantization can be a natural choice to accelerate QAT.

HuangOwen commented 1 year ago

@HuangOwen I have a further question, since the motivation of this work is to accelerate QAT, have you tried any experiments on gradient quantization, since the backprop process can be computation-intensive, even more intensive than forward inference? Gradient quantization can be a natural choice to accelerate QAT.

Thanks for your question. We currently have no plan for gradient quantization. This work's motivation is to improve QAT's efficiency, including both data and time efficiency. If we only use 10% data, both the forward and backward time on the rest 90% data will be reduced.

I totally agree that quantization on the gradient is a natural choice. In fact, many previous works of training quantization have conducted similar experiments. You could check it out (DoReFa-Net, NeurIPS 2018, ICLR 2020, CVPR 2020). The problem is that most of these works are limited to 8-bit quantization, as gradients are much more sensitive to quantization than weights and activations.