NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.1k stars 895 forks source link

attention fp8 compute type #1921

Closed enozhu closed 1 week ago

enozhu commented 1 month ago

when we use fp8 data type , we found ffn gemm/atten prj support real fp8 comute(this is supported on H20、L20), but Qtransopse(Key) or softmax value in attention dosen't support fp8 compute, need to first dequantize fp8 to fp16/bf16, why ?

QiJune commented 1 month ago

@Tracin Could you please have a look? Thanks

Tracin commented 1 month ago

@enozhu Because we do not implement FP8 FMHA before Hopper. So I think H20 can support attention with FP8 computation.

unbelievable3513 commented 1 month ago

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

Tracin commented 1 month ago

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

github-actions[bot] commented 1 week ago

This issue was closed because it has been stalled for 15 days with no activity.

wanzhenchn commented 3 days ago

When enabling use_fp8_context_fmha, the context phase is supposed to perform the Q@K^T operation in FP8, but it seems that qkv is not being properly quantized to FP8. Instead, it uses a scale of 1.0 to forcefully cast FP16/BF16 to FP8 in the function applyBiasRopeUpdateKVCache with convert_to_fp8:

if (params.quantized_fp8_output)
{
    // use 1.0f scale currently for qkv input of FP8 FMHA.
    mmha::convert_to_fp8(quantized_q_ptr, q);
}

Am I missing some information, or is this done intentionally, and wouldn't this cause precision issues, or is the numerical range clamped during training? @Tracin

Yeah, you are right. We set scale to 1.0 intentionally for fast conversion and this won't hurt model accuracy from our study.

Could you please share researches on the impact of setting the scale to 1.0 on model accuracy? @Tracin