NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

FP8 FMHA cannot be enabled on Pre-Hopper Arch in L40? #1864

Closed activezhao closed 4 days ago

activezhao commented 4 days ago

System Info

CPU x86_64

GPU NVIDIA L40

TensorRT branch: v0.10.0

CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2

Who can help?

@Tracin

Information

Tasks

Reproduction

I want to use KV-Cache-Reuse and Chunked Context, I use the following commands:

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir /data/trt-v10-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
                                   --calib_size 512 \
                                   --tp_size 2

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir /data/trt-v10-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
             --output_dir /data/trt-v10-engines-deepseek6.7b-online-v2.1-2gpu-fp8-bz32/2-gpu \
            --max_input_len 8192 \
            --max_output_len 1024 \
            --gemm_plugin float16 \
            --strongly_typed \
            --paged_kv_cache enable \
            --gpt_attention_plugin float16 \
            --max_batch_size 8  \
            --paged_kv_cache enable \
            --max_num_tokens 128 \
            --use_paged_context_fmha enable \
            --use_fp8_context_fmha enable \
            --workers 2

Expected behavior

The commands can work.

actual behavior

I got the following errors:

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: FP8 FMHA cannot be enabled on Pre-Hopper Arch. (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:462)
1       0x7fdc83c232c3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x572c3) [0x7fdc83c232c3]
2       0x7fdc83c237e0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x577e0) [0x7fdc83c237e0]
3       0x7fdc83ca4f60 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(int, int, int, int, int, int, int, float, tensorrt_llm::kernels::PositionEmbeddingType, int, float, tensorrt_llm::kernels::RotaryScalingType, float, float, int, int, int, bool, tensorrt_llm::kernels::ContextFMHAType, bool, bool, int, bool, tensorrt_llm::kernels::AttentionMaskType, bool, int, nvinfer1::DataType, int, bool, bool, int, bool, bool, bool, bool, bool, bool) + 256
4       0x7fdc83ca5bc2 tensorrt_llm::plugins::GPTAttentionPluginCreator::createPlugin(char const*, nvinfer1::PluginFieldCollection const*) + 3090
5       0x7fde1515c62a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x15c62a) [0x7fde1515c62a]
6       0x7fde150457a3 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x457a3) [0x7fde150457a3]
7       0x556000c3f10e /usr/bin/python3(+0x15a10e) [0x556000c3f10e]

I use L40, so FP8 FMHA cannot be enabled on Ada?

additional notes

Hope there is a way to solve it.

Thanks.

nv-guomingz commented 4 days ago

Yes, we only enable FP8 FMHA for Hopper(SM90) at this moment. cc @PerkzZheng for vis

activezhao commented 4 days ago

Yes, we only enable FP8 FMHA for Hopper(SM90) at this moment. cc @PerkzZheng for vis

@nv-guomingz OK, Got it.

Is there any plan to support Ada Arch?

We really want to use KV-Cache-Reuse feature.

Thanks so much!

PerkzZheng commented 4 days ago

@activezhao yes, this is on our roadmap, but there is no concret date. Will update here if we have any progress. Note that there are potential accuracy concerns with FP8 FMHA, I would suggest that you can try that on hopper first.

nv-guomingz commented 4 days ago

thank @PerkzZheng @activezhao could we close this ticket now?

activezhao commented 4 days ago

@activezhao yes, this is on our roadmap, but there is no concret date. Will update here if we have any progress. Note that there are potential accuracy concerns with FP8 FMHA, I would suggest that you can try that on hopper first.

@PerkzZheng OK, thanks.

activezhao commented 4 days ago

thank @PerkzZheng @activezhao could we close this ticket now?

@nv-guomingz Of course, please close it.

Thanks.