FP8 FMHA cannot be enabled on Pre-Hopper Arch in L40?

activezhao commented 4 days ago

System Info

CPU x86_64

GPU NVIDIA L40

TensorRT branch: v0.10.0

CUDA: NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2

Who can help?

@Tracin

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I want to use KV-Cache-Reuse and Chunked Context, I use the following commands:

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-6.7b-online-v2.1 \
                                   --dtype float16 \
                                   --qformat fp8 \
                                   --kv_cache_dtype fp8 \
                                   --output_dir /data/trt-v10-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
                                   --calib_size 512 \
                                   --tp_size 2

# Build trtllm engines from the trtllm checkpoint
trtllm-build --checkpoint_dir /data/trt-v10-deepseek6.7b-online-v2.1-2gpu-fp8-bz32 \
             --output_dir /data/trt-v10-engines-deepseek6.7b-online-v2.1-2gpu-fp8-bz32/2-gpu \
            --max_input_len 8192 \
            --max_output_len 1024 \
            --gemm_plugin float16 \
            --strongly_typed \
            --paged_kv_cache enable \
            --gpt_attention_plugin float16 \
            --max_batch_size 8  \
            --paged_kv_cache enable \
            --max_num_tokens 128 \
            --use_paged_context_fmha enable \
            --use_fp8_context_fmha enable \
            --workers 2

Expected behavior

The commands can work.

actual behavior

I got the following errors:

[TensorRT-LLM][ERROR] tensorrt_llm::common::TllmException: [TensorRT-LLM][ERROR] Assertion failed: FP8 FMHA cannot be enabled on Pre-Hopper Arch. (/app/tensorrt_llm/cpp/tensorrt_llm/plugins/gptAttentionCommon/gptAttentionCommon.cpp:462)
1       0x7fdc83c232c3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x572c3) [0x7fdc83c232c3]
2       0x7fdc83c237e0 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x577e0) [0x7fdc83c237e0]
3       0x7fdc83ca4f60 tensorrt_llm::plugins::GPTAttentionPlugin::GPTAttentionPlugin(int, int, int, int, int, int, int, float, tensorrt_llm::kernels::PositionEmbeddingType, int, float, tensorrt_llm::kernels::RotaryScalingType, float, float, int, int, int, bool, tensorrt_llm::kernels::ContextFMHAType, bool, bool, int, bool, tensorrt_llm::kernels::AttentionMaskType, bool, int, nvinfer1::DataType, int, bool, bool, int, bool, bool, bool, bool, bool, bool) + 256
4       0x7fdc83ca5bc2 tensorrt_llm::plugins::GPTAttentionPluginCreator::createPlugin(char const*, nvinfer1::PluginFieldCollection const*) + 3090
5       0x7fde1515c62a /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x15c62a) [0x7fde1515c62a]
6       0x7fde150457a3 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x457a3) [0x7fde150457a3]
7       0x556000c3f10e /usr/bin/python3(+0x15a10e) [0x556000c3f10e]

I use L40, so FP8 FMHA cannot be enabled on Ada？

additional notes

Hope there is a way to solve it.

Thanks.

nv-guomingz commented 4 days ago

Yes, we only enable FP8 FMHA for Hopper(SM90) at this moment. cc @PerkzZheng for vis

activezhao commented 4 days ago

Yes, we only enable FP8 FMHA for Hopper(SM90) at this moment. cc @PerkzZheng for vis

@nv-guomingz OK, Got it.

Is there any plan to support Ada Arch?

We really want to use KV-Cache-Reuse feature.

Thanks so much!

PerkzZheng commented 4 days ago

@activezhao yes, this is on our roadmap, but there is no concret date. Will update here if we have any progress. Note that there are potential accuracy concerns with FP8 FMHA, I would suggest that you can try that on hopper first.

nv-guomingz commented 4 days ago

thank @PerkzZheng @activezhao could we close this ticket now?

activezhao commented 4 days ago

@activezhao yes, this is on our roadmap, but there is no concret date. Will update here if we have any progress. Note that there are potential accuracy concerns with FP8 FMHA, I would suggest that you can try that on hopper first.

@PerkzZheng OK, thanks.

activezhao commented 4 days ago

thank @PerkzZheng @activezhao could we close this ticket now?

@nv-guomingz Of course, please close it.

Thanks.

NVIDIA / TensorRT-LLM