Open jinweida opened 2 weeks ago
@Tracin could you please have a look? Thanks
why you use gemm_plugin with bfloat16 and not fp8?
also they mention to disable it: https://nvidia.github.io/TensorRT-LLM/performance/perf-best-practices.html#gemm-plugin
also where you see that NVIDIA L20 GPU supports fp8?
@jinweida Looks like FP8 FMHA
can not be supported on L20, please remove --use_fp8_context_fmha enable
from your command.
@Tracin where do you see that NVIDIA L20 GPU supports fp8? he uses also: --qformat fp8
spec:
edit: just saw there is fp8...
@Tracin The dealer say L20 GPU supports fp8
@jinweida Yeah, I mean you can still use FP8 gemm on L20 if you remove --use_fp8_context_fmha enable
. FP8 FMHA
is a new feature and not cover L20 for now.
how do I accelerate fp8 with L20? @Tracin
how do I accelerate fp8 with L20? @Tracin
If you mean accelerate LLM on L20 by FP8 gemm, you are doing in the correct way.
python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/
The FP8 FMHA on SM89 (L20) is on going. So, you could only enable FP8 GEMM on L20 now.
System Info
CPU architecture: x86_64 Host RAM: 1TB GPU: 2xL20 SXM Container: Manually built container with TRT 9.3 Dockerfile.trt_llm_backend TensorRT-LLM version: 0.12.0.dev2024070200" Driver Version: 550.54.15 CUDA Version: 12.4 OS: Ubuntu 22.04
[TensOrRT-LLM]ГERROR]tensort llm.:comon::TlmException: [TensorRT-LM][ERR] Assertion failed: Fp8 FMHA cannot be enabled on pre-Hopper Arcdh.
CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16
ERROR:
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
CUDA_VISIBLE_DEVICES=0,1 python ../quantization/quantize.py \ --model_dir /nvme0/hub/modelscope/baichuan-inc/Baichuan2-7B-Chat \ --dtype bfloat16 \ --qformat fp8 \ --calib_dataset /nvme0/ai/fp8/TensorRT-LLM/cnn_dailymail \ --output_dir ./quantized_fp8 \ --calib_size 256
CUDA_VISIBLE_DEVICES=0,1 trtllm-build --checkpoint_dir ./quantized_fp8/ \ --output_dir ./quantized_fp8-1-gpu/ \ --use_fp8_context_fmha enable \ --gemm_plugin bfloat16