Open feifeibear opened 4 months ago
fa3 seems to be designed for hopper architecture (h100) so a100 would not see performances boost. FP8 is not natively supporting neither on a100.
fa3 似乎是为 hopper 架构 (h100) 设计的,因此 a100 的性能不会提升。FP8 在 a100 上本身不支持。
Maybe Flash Attention 3 fp8 will be supported on 4090?
Is it possible to apply warp-specialized software pipelining scheme on A100?
It's not commonly done. FA2 is already close to optimal on A100 (70% max theoretical FLOPS).
Well, for some other GPUs (such as AMD GPUs or other manufacturers' GPUs, they do not have most of the features of the Hopper architecture, etc. TMA, WGMMA or FP8), if I want to optimize or design a specific flash-attention, is it better or easier to design follow the fa3’s warp-specialized method? Or it is not necessary?
Warp-specialization will be difficult without the async features. Overlapping gemm and softmax would still be useful.
Thanks for open-sourcing FA3, good job! I am wondering about the FP8 feature.
Compatibility: Are the NVIDIA L40 and A100 GPUs compatible with the Flash Attention 3 FP8 feature?
Performance: What are the expected performance gains or trade-offs when using Flash Attention 3 FP8 on these GPUs?
Implementation: Is there any specific implementation or software requirement to enable Flash Attention 3 FP8 on L40 and A100 GPUs?