LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.98k stars 349 forks source link

flashattention flag #818

Closed Vladonai closed 4 months ago

Vladonai commented 5 months ago

It is unclear which models are compatible with this flag. Are new models needed? Or are some specific model architectures supported? I also wonder if this feature is compatible with the Pascal GPU architecture.

gustrd commented 5 months ago

I could make my RTX work with it, and got 3x prompt processing speedup. (Llama3-8B model)

But my MX card, using the same config, received the error:

ggml-cuda/fattn.cu:571: ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 610. ggml-cuda.cu was compiled for: 610 CUDA error: unspecified launch failure current device: 0, in function ggml_backend_cuda_synchronize at ggml-cuda.cu:2414 cudaStreamSynchronize(cuda_ctx->stream()) GGML_ASSERT: ggml-cuda.cu:63: !"CUDA error"

I tried different build parameters but the error continues.

I'm compiing from source with CUDA 12, Debian.

LostRuins commented 5 months ago

Flash Attention only reliably works for card above Turing (RTX 20XX series). Your card probably is too old.

gustrd commented 5 months ago

Flash Attention only reliably works for card above Turing (RTX 20XX series). Your card probably is too old.

Thanks. I imagined that was the case.

What about clBlast, Metal and OpenBLAS? Is FlashAttention CUDA specific?

LostRuins commented 5 months ago

CUDA and Metal only. There's a CPU implementation for the rest, but it provides no speedup.