NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.62k stars 830 forks source link

does NVIDIA L20 GPUs support FP8 quantization? #1914

Open jinweida opened 2 weeks ago

jinweida commented 2 weeks ago

System Info

CPU architecture: x86_64 Host RAM: 1TB GPU: 2xL20 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TensorRT-LLM version: 0.12.0.dev2024070200" Driver Version: 550.54.15 CUDA Version: 12.4 OS: Ubuntu 22.04

image

[TensOrRT-LLM]ГERROR]tensort llm.:comon::TlmException: [TensorRT-LM][ERR] Assertion failed: Fp8 FMHA cannot be enabled on pre-Hopper Arcdh.

CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16

ERROR:

image

Who can help?

No response

Information

Tasks

Reproduction

CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16

QiJune commented 2 weeks ago

@Tracin could you please have a look? Thanks

geraldstanje1 commented 1 week ago

why you use gemm_plugin with bfloat16 and not fp8?

also they mention to disable it: https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#gemm-plugin

also where you see that NVIDIA L20 GPU supports fp8?

Tracin commented 1 week ago

@jinweida Looks like FP8 FMHA can not be supported on L20, please remove --use_fp8_context_fmha enable from your command.

geraldstanje1 commented 1 week ago

@Tracin where do you see that NVIDIA L20 GPU supports fp8? he uses also: --qformat fp8

spec: image

edit: just saw there is fp8...

jinweida commented 1 week ago

@Tracin The dealer say L20 GPU supports fp8

Tracin commented 1 week ago

@jinweida Yeah, I mean you can still use FP8 gemm on L20 if you remove --use_fp8_context_fmha enable. FP8 FMHA is a new feature and not cover L20 for now.

jinweida commented 1 week ago

how do I accelerate fp8 with L20? @Tracin

Tracin commented 1 week ago

how do I accelerate fp8 with L20? @Tracin

If you mean accelerate LLM on L20 by FP8 gemm, you are doing in the correct way.

python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256

CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/
byshiue commented 5 days ago

The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.