Closed Vladonai closed 4 months ago
I could make my RTX work with it, and got 3x prompt processing speedup. (Llama3-8B model)
But my MX card, using the same config, received the error:
ggml-cuda/fattn.cu:571: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 610. ggml-cuda.cu was compiled for: 610 CUDA error: unspecified launch failure current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2414 cudaStreamSynchronize(cuda_ctx->stream()) GGML_ASSERT: ggml-cuda.cu:63: !"CUDA error"
I tried different build parameters but the error continues.
I'm compiing from source with CUDA 12, Debian.
Flash Attention only reliably works for card above Turing (RTX 20XX series). Your card probably is too old.
Flash Attention only reliably works for card above Turing (RTX 20XX series). Your card probably is too old.
Thanks. I imagined that was the case.
What about clBlast, Metal and OpenBLAS? Is FlashAttention CUDA specific?
CUDA and Metal only. There's a CPU implementation for the rest, but it provides no speedup.
It is unclear which models are compatible with this flag. Are new models needed? Or are some specific model architectures supported? I also wonder if this feature is compatible with the Pascal GPU architecture.