TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
System Info
AWS p5 (4 x 80GB H100 GPUs) TensorRT-LLM v0.11.0
Who can help?
@byshiue @Tracin
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python ./quantize.py --model_dir ./Meta-Llama-3-70B-Instruct --dtype bfloat16 --output_dir ./Meta-Llama-3-70B-Instruct_fp8 --calib_size 1024 --calib_dataset /home/triton-server/calibration --tp_size 4 --qformat fp8
trtllm-build --checkpoint_dir ./Meta-Llama-3-70B-Instruct_fp8 --output_dir ./Meta-Llama-3-70B-Instruct_fp8_engine_fmha --gemm_plugin auto --workers 1 --use_paged_context_fmha enable --use_fp8_context_fmha enable --max_batch_size 16
Expected behavior
Engine created succesfully.
actual behavior
Engine build fails with
additional notes
The engine build runs fine when I don't include
--use_paged_context_fmha enable --use_fp8_context_fmha enable
on runningtrtllm-build
.