NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

Enough VRAM to run a model, but not enough to quantize #1849

Open oobabooga opened 6 days ago

oobabooga commented 6 days ago

On my system, I have enough VRAM (72 GB) to run Llama-3-70B in 4-bit or 8-bit precision. However, I am unable to quantize this model to either 4-bit or 8-bit precision using the scripts in TensorRT-LLM due to "CUDA out of memory" errors, like:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.96 GiB. GPU 0 has a total capacity of 47.51 GiB of which 1.57 GiB is free.

For reference, with llama.cpp, ExLlamaV2, and AutoGPTQ I am able to successfuly quantize the 16-bit NousResearch/Meta-Llama-3-70B-Instruct model on my system.

I am not sure if I am doing something wrong, or if this is just a limitation of TensorRT-LLM. I suspect it's the latter, as the documentation says the following for FP8 quantization:

The peak GPU memory consumption when doing FP8 quantizaton is more than 210GB (there is also some activation memory occupation when doing calibration). So you need a node with at least 4 H100(A100) to run the quantization command. After quantization, 2 GPUs are okay to for building and run.

Is this something that can be improved? That makes TensorRT-LLM very prohibitive to use on consumer hardware.

Reproduction

This is what I have tried. The scripts are based on the examples found at: examples/llama/README.md.

In all cases, I get this message saying that xformers is not available (not sure if that makes a difference):

WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.0.1+cu118 with CUDA 1108 (you have 2.2.2+cu121) Python 3.10.11 (you have 3.10.14)

AWQ

#!/bin/bash

CHECKPOINT_DIR=/home/user/text-generation-webui/models/NousResearch_Meta-Llama-3-70B-Instruct

cd /home/user/text-generation-webui/TensorRT-LLM

python examples/quantization/quantize.py \
    --model_dir $CHECKPOINT_DIR \
    --dtype float16 \
    --qformat int4_awq \
    --awq_block_size 128 \
    --output_dir $(basename "$CHECKPOINT_DIR")_awq_checkpoint \
    --calib_size 32

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_awq_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_AWQ_TensorRT \
    --gemm_plugin auto

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_AWQ_TensorRT

Smoothquant

#!/bin/bash

CHECKPOINT_DIR=/home/user/text-generation-webui/models/NousResearch_Meta-Llama-3-70B-Instruct

cd /home/user/text-generation-webui/TensorRT-LLM/

python examples/llama/convert_checkpoint.py \
    --model_dir $CHECKPOINT_DIR \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --dtype float16 \
    --smoothquant 0.5

trtllm-build --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_TensorRT \
    --gemm_plugin auto

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_TensorRT

FP8

#!/bin/bash

CHECKPOINT_DIR=/home/user/text-generation-webui/models/NousResearch_Meta-Llama-3-70B-Instruct

cd /home/user/text-generation-webui/TensorRT-LLM/

python examples/quantization/quantize.py \
    --model_dir $CHECKPOINT_DIR \
    --dtype float16 \
    --qformat fp8 \
    --kv_cache_dtype fp8 \
    --output_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --calib_size 512 \
    --tp_size 2

trtllm-build \
    --checkpoint_dir $(basename "$CHECKPOINT_DIR")_checkpoint \
    --output_dir ${CHECKPOINT_DIR}_TensorRT \
    --gemm_plugin auto \
    --workers 2

# Copy the tokenizer files
cp ${CHECKPOINT_DIR}/{tokenizer*,special_tokens_map.json} ${CHECKPOINT_DIR}_TensorRT
nv-guomingz commented 5 days ago

If you wanna to quantize the llama 70b on a single device with 72GB RAM(may I know the exactly SKU of this GPU, like A100 or soemthing else?), I suggest u add --device cpu to offload the model to cpu memory.

oobabooga commented 5 days ago

I have two GPUs totalling 72 GB:

About --device cpu, this is not a recognized flag for the Python scripts that fail due to CUDA out of memory, which are:

Where should this flag be used?