Open Xiaoyiyong555 opened 4 months ago
idk anything about L20
looks like an Ampere card, not Hopper. So no.
L20 is Ada Lovelace architecture, is similar to L40. L20 supports FP8, I really hope that FA3 can support FP8 inference on the L20 graphics card
Feel free to work on it if you need it.
I think that might be not possible, lemme explain:
L40/ L40s and other ada generation GPUs are using CUDA compatibility 8.9 which might lack the capability of what FA3 mainly rely on to improve: wgmma
After some try and error I found the current implementation of FA3 would crash on cutlass here: https://github.com/NVIDIA/cutlass/blob/56b46e2d13875b46b8f6a03f9f5ac91e2bfdc01a/include/cute/atom/copy_traits_sm90_tma.hpp#L922C24-L934
CUresult result = cuTensorMapEncodeTiled(
&tma_desc,
tma_format,
tma_dim,
gmem_address,
gmem_prob_shape.data(),
gmem_prob_stride.data() + 1, // gmem_prob_stride[0] implicitly 1
smem_box_shape.data(),
smem_box_stride.data(),
tma_interleave,
smem_swizzle,
tma_l2Promotion,
tma_oobFill);
if (result != CUDA_SUCCESS) {
std::cerr << "TMA Desc Addr: " << &tma_desc
<< "\nformat " << tma_format
<< "\ndim " << tma_dim
<< "\ngmem_address " << gmem_address
<< "\nglobalDim " << gmem_prob_shape
<< "\nglobalStrides " << gmem_prob_stride
<< "\nboxDim " << smem_box_shape
<< "\nelementStrides " << smem_box_stride
<< "\ninterleave " << tma_interleave
<< "\nswizzle " << smem_swizzle
<< "\nl2Promotion " << tma_l2Promotion
<< "\noobFill " << tma_oobFill << std::endl;
std::cerr << "Error: Failed to initialize the TMA descriptor " << result << std::endl;
assert(false);
which calls cuTensorMapEncodeTiled
, which, according to documentation, only works on compatibility 9.0+
Also there is no specific arch code on cutlass for sm89.
I would suggest to change this issue title to L40 tho, as there aren't any mentioning on Nvidia resources for L20.
Right, Ada architecture doesn't have WGMMA and TMA. FA2 might already be close to optimal for Ada architecture.
@tridao are we still expecting to see FP8 support added to FA without WGMMA and TMA tho?
Ofc that's welcome. Depends on whether people want to contribute.
l20 is modified from the H100 architecture and also has FP8 capability. Does flash-attn3 support it?