Post Quantization for nllb-models

Hello! Sorry for late answer. Unfortunately we did not try SpQR the SpQR technique on an encoder-decoder type models. While it is speculative on my part, I believe that since SpQR (similar to GPTQ) performs quantization per layer, the encoder component in "ecoder-decoder" of the model would require minimal changes to be compatible with SpQR (such as adjusting namings and potentially caching, as seen in this code snippet: https://github.com/Vahe1994/SpQR/blob/1c27ed6294d31f8f508ef02f95fb2bac0337d0a6/main.py#L114C46-L114C47). However, the decoder component would need to store the last activation from the encoder in order to calculate the input and output of the linear layer in the decoder blocks. If you have the input, output, and weights, you can run the SpQR engine on the layer. Therefore, the main part that requires modification is in the main.py file, where you need to retrieve the input and output for each the linear layer that you want to quantize.

You can take a look at the example for t5 for GPTQ for reference https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/t5/t5.py . If you encounter any problem please let us know, we will try to help you.

Vahe1994 / SpQR

Post Quantization for nllb-models #19