Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.41k stars 1.22k forks source link

Compatibility of Flash Attention 3 FP8 Feature with L40 and A100 GPUs #1048

Open feifeibear opened 2 months ago

feifeibear commented 2 months ago

Thanks for open-sourcing FA3, good job! I am wondering about the FP8 feature.

Compatibility: Are the NVIDIA L40 and A100 GPUs compatible with the Flash Attention 3 FP8 feature?

Performance: What are the expected performance gains or trade-offs when using Flash Attention 3 FP8 on these GPUs?

Implementation: Is there any specific implementation or software requirement to enable Flash Attention 3 FP8 on L40 and A100 GPUs?

samsja commented 1 month ago

fa3 seems to be designed for hopper architecture (h100) so a100 would not see performances boost. FP8 is not natively supporting neither on a100.

songh11 commented 1 month ago

fa3 似乎是为 hopper 架构 (h100) 设计的,因此 a100 的性能不会提升。FP8 在 a100 上本身不支持。

Maybe Flash Attention 3 fp8 will be supported on 4090?

KyeeHuang commented 1 month ago

Is it possible to apply warp-specialized software pipelining scheme on A100?

tridao commented 1 month ago

It's not commonly done. FA2 is already close to optimal on A100 (70% max theoretical FLOPS).

KyeeHuang commented 1 month ago

Well, for some other GPUs (such as AMD GPUs or other manufacturers' GPUs, they do not have most of the features of the Hopper architecture, etc. TMA, WGMMA or FP8), if I want to optimize or design a specific flash-attention, is it better or easier to design follow the fa3’s warp-specialized method? Or it is not necessary?

tridao commented 1 month ago

Warp-specialization will be difficult without the async features. Overlapping gemm and softmax would still be useful.