Open oobabooga opened 6 days ago
If you wanna to quantize the llama 70b on a single device with 72GB RAM(may I know the exactly SKU of this GPU, like A100 or soemthing else?), I suggest u add --device cpu
to offload the model to cpu memory.
I have two GPUs totalling 72 GB:
About --device cpu
, this is not a recognized flag for the Python scripts that fail due to CUDA out of memory, which are:
examples/quantization/quantize.py
(used for AWQ and FP8)examples/llama/convert_checkpoint.py
(used for Smoothquant)Where should this flag be used?
On my system, I have enough VRAM (72 GB) to run Llama-3-70B in 4-bit or 8-bit precision. However, I am unable to quantize this model to either 4-bit or 8-bit precision using the scripts in TensorRT-LLM due to "CUDA out of memory" errors, like:
For reference, with llama.cpp, ExLlamaV2, and AutoGPTQ I am able to successfuly quantize the 16-bit NousResearch/Meta-Llama-3-70B-Instruct model on my system.
I am not sure if I am doing something wrong, or if this is just a limitation of TensorRT-LLM. I suspect it's the latter, as the documentation says the following for FP8 quantization:
Is this something that can be improved? That makes TensorRT-LLM very prohibitive to use on consumer hardware.
Reproduction
This is what I have tried. The scripts are based on the examples found at: examples/llama/README.md.
In all cases, I get this message saying that xformers is not available (not sure if that makes a difference):
AWQ
Smoothquant
FP8