[Question] How to solve OOM problem when quantizing model with trt-llm?

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

8.61k stars 979 forks source link

Hi,

@byshiue

I wanna to quantize a llama model with long sequence 120K+， but an OOM Error raised. So I hope to solve OOM problem with multi gpus when quantizing model in convert_checkpoint.py.

But I encountered the same OOM problem, and I am sure that the total memory of these gpus is enough for quantizing model.

So I check the gpus usage, and find that only one gpu works, the other gpus does not work, memory usage shows ~0%.

Does quantizing model not support with multi gpus in trt-llm currently? How to solve this OOM error when quantizing model?

Any help will be appreciated. The script shows as follows:

python convert_checkpoint.py --model_dir ./weights/8B_128K/ \ --output_dir ./outputs/checkpoints/8B_128K/ \ --dtype float16 \ --int8_kv_cache \ --rotary_base 50000000

NVIDIA / TensorRT-LLM

[Question] How to solve OOM problem when quantizing model with trt-llm? #1546