Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
14.46k stars 1.36k forks source link

flash-attn3 supported L20? #1068

Open Xiaoyiyong555 opened 4 months ago

Xiaoyiyong555 commented 4 months ago

l20 is modified from the H100 architecture and also has FP8 capability. Does flash-attn3 support it?

tridao commented 4 months ago

idk anything about L20

tridao commented 4 months ago

looks like an Ampere card, not Hopper. So no.

sgxu commented 4 months ago

L20 is Ada Lovelace architecture, is similar to L40. L20 supports FP8, I really hope that FA3 can support FP8 inference on the L20 graphics card

tridao commented 4 months ago

Feel free to work on it if you need it.

Willian-Zhang commented 4 months ago

I think that might be not possible, lemme explain:

Willian-Zhang commented 4 months ago

L40/ L40s and other ada generation GPUs are using CUDA compatibility 8.9 which might lack the capability of what FA3 mainly rely on to improve: wgmma

Detail

After some try and error I found the current implementation of FA3 would crash on cutlass here: https://github.com/NVIDIA/cutlass/blob/56b46e2d13875b46b8f6a03f9f5ac91e2bfdc01a/include/cute/atom/copy_traits_sm90_tma.hpp#L922C24-L934

    CUresult result = cuTensorMapEncodeTiled(
        &tma_desc,
        tma_format,
        tma_dim,
        gmem_address,
        gmem_prob_shape.data(),
        gmem_prob_stride.data() + 1,  // gmem_prob_stride[0] implicitly 1
        smem_box_shape.data(),
        smem_box_stride.data(),
        tma_interleave,
        smem_swizzle,
        tma_l2Promotion,
        tma_oobFill);

    if (result != CUDA_SUCCESS) {
      std::cerr << "TMA Desc Addr:   " << &tma_desc
                << "\nformat         " << tma_format
                << "\ndim            " << tma_dim
                << "\ngmem_address   " << gmem_address
                << "\nglobalDim      " << gmem_prob_shape
                << "\nglobalStrides  " << gmem_prob_stride
                << "\nboxDim         " << smem_box_shape
                << "\nelementStrides " << smem_box_stride
                << "\ninterleave     " << tma_interleave
                << "\nswizzle        " << smem_swizzle
                << "\nl2Promotion    " << tma_l2Promotion
                << "\noobFill        " << tma_oobFill << std::endl;
      std::cerr << "Error: Failed to initialize the TMA descriptor " << result << std::endl;
      assert(false);

which calls cuTensorMapEncodeTiled, which, according to documentation, only works on compatibility 9.0+

image

Also there is no specific arch code on cutlass for sm89.

Willian-Zhang commented 4 months ago

I would suggest to change this issue title to L40 tho, as there aren't any mentioning on Nvidia resources for L20.

tridao commented 4 months ago

Right, Ada architecture doesn't have WGMMA and TMA. FA2 might already be close to optimal for Ada architecture.

Willian-Zhang commented 4 months ago

@tridao are we still expecting to see FP8 support added to FA without WGMMA and TMA tho?

tridao commented 4 months ago

Ofc that's welcome. Depends on whether people want to contribute.