Closed zshn25 closed 1 year ago
Adding to what Denis said, you may also want to take a look at our OBC work https://github.com/IST-DASLab/OBC, where we study also vision model compression with a more accurate but less scalable predecessor of GPTQ.
There is a quant_cuda_kernel for quant_linear module, but it can not be applied to conv layers right? So there can only be fake quantization for conv layers?
You can do it in a similar fashion.
Take the dataset of interest, select some subset from it, say 1024 images. And propagate the activations in a same way as here for OPT,Bloom and LLaMa models. For ViTs procedure is very similar and for ResNets you would need to wrap conv layers with GPTQ class and run quantize on it.
However, for vision models 3/4 bit quantization would likely degrade the performance significantly and there is usually no much need in it, since they are already quite small unless you quantize recent ViT-22B.
Yes, our sample kernels here are designed for the memory-bound single-token linear-layer case that occurs for very large generative NLP models. Probably the most simple option for running quantized conv layers would be to fully decompress them on the fly before executing the corresponding convolution. This would still give you memory savings during inference (you only need to have a single dequantized layer in memory at a time) but will not give you any speedups (in any case, vision models are usually not memory-bound so weight-only quantization is unlikely to give you speedups).
You can do it in a similar fashion.
Take the dataset of interest, select some subset from it, say 1024 images. And propagate the activations in a same way as here for OPT,Bloom and LLaMa models. For ViTs procedure is very similar and for ResNets you would need to wrap conv layers with GPTQ class and run quantize on it.
However, for vision models 3/4 bit quantization would likely degrade the performance significantly and there is usually no much need in it, since they are already quite small unless you quantize recent ViT-22B.