TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
I wanna to quantize a llama model with long sequence 120K+, but an OOM Error raised. So I hope to solve OOM problem with multi gpus when quantizing model in convert_checkpoint.py.
But I encountered the same OOM problem, and I am sure that the total memory of these gpus is enough for quantizing model.
So I check the gpus usage, and find that only one gpu works, the other gpus does not work, memory usage shows ~0%.
Does quantizing model not support with multi gpus in trt-llm currently? How to solve this OOM error when quantizing model?
Any help will be appreciated. The script shows as follows:
Hi,
@byshiue
I wanna to quantize a llama model with long sequence 120K+, but an OOM Error raised. So I hope to solve OOM problem with multi gpus when quantizing model in convert_checkpoint.py.
But I encountered the same OOM problem, and I am sure that the total memory of these gpus is enough for quantizing model.
So I check the gpus usage, and find that only one gpu works, the other gpus does not work, memory usage shows ~0%.
Does quantizing model not support with multi gpus in trt-llm currently? How to solve this OOM error when quantizing model?
Any help will be appreciated. The script shows as follows:
python convert_checkpoint.py --model_dir ./weights/8B_128K/ \ --output_dir ./outputs/checkpoints/8B_128K/ \ --dtype float16 \ --int8_kv_cache \ --rotary_base 50000000