Vahe1994 / AQLM

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
Apache License 2.0
1.13k stars 173 forks source link

Quantization time & VRAM requirements #8

Closed oobabooga closed 7 months ago

oobabooga commented 7 months ago

Hello,

I have two basic questions:

1) Do you have any data on how long it takes to quantize a 70b model using 24GB VRAM (assuming that's possible)? 2) Do you plan to release prequantized models on Hugging Face? Having llama-2-70b for comparison with other methods would be useful.

Vahe1994 commented 7 months ago

Hello!

  1. With current code(without any major changes), I don't think LLama-2 70B model will fit into 24GB VRAM, for quantization phase.
  2. Yes, we are planing to do so.
Vahe1994 commented 7 months ago

Hi! @oobabooga, we have just released quantized models on Hugging Face(including LLama-2 70B). Check the Readme.md for details. Please refer to the notebooks (for streaming or generation) for examples on how to use them. Hope this helps!

oobabooga commented 7 months ago

Thanks @Vahe1994, that's very helpful. I'll try to test the models later.