OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Quant script for large models like 180B and 70B models? #10

Closed yhyu13 closed 11 months ago

yhyu13 commented 11 months ago

Hi, would you please provide script and hardware specs that can run quant models for these super large models?

Did OmniQuant load model mostly to RAM and only evaluate certain layers in VRAM so to lower the GPU requirement to average joe?

ChenMnZ commented 11 months ago

Scripts can be found at https://github.com/OpenGVLab/OmniQuant/tree/main/scripts.

To train the quantized parameters successfully, llama-2-70b requires A100-40G and falcon-180b requires A100-80G.

Yes, at first, the entire model wound be loaded into RAM, and only the currently trained layer would be transfer into VRAM. Such manner decreases the peak requirements of VRAM.

yhyu13 commented 11 months ago

That's great, I will try this out for sometimes.

Another question, CUDA_VISIBLE_DEVICES=0 is set for all scripts. Does that mean model distribution is not supported at this moment?

I am using dual 3090 with 24G VRAM each. Some times ago, I was quantizing GPTQ models with GPTQ-for-LLaMA repo, it supports model distribution as well as RAM offloading which is nice. It is always nice to have model distribution.

ChenMnZ commented 11 months ago

In the training process, we load one block into one GPU without model distribution. However, you can use --multigpu to enable model distribution in the evaluation process.

Thank you for your suggestion, we will consider about how to add model distribution and RAM offloading.