Add FA32 Flag/Toggle for RTX cards to prefer CUDA instead?

LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI

GNU Affero General Public License v3.0

4.35k stars 312 forks source link

Was wondering if it would be possible to get an option either in command line or in the launch GUI that would allow RTX cards to use the alternate implementation of Flash Attention instead of the one relying on tensor cores?

Bringing this up as I have an RTX 2060 6gb that seems to a performance degradation with the standard implementation. No clue if the alternative would fix the problem, but figure I should put in a report in case there's others who might benefit from it.

I have a RTX 2060 as well and I'm getting higher performance with Flash Attention, provided I'm using full GPU offloading and a context of 8192 tokens for 8B.

Perhaps it's reducing performance in partial offload for you? Partial offload FA is pretty slow right now.

LostRuins / koboldcpp

Add FA32 Flag/Toggle for RTX cards to prefer CUDA instead? #869