LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.35k stars 312 forks source link

Flash Attention #844

Closed ss4elby closed 1 month ago

ss4elby commented 1 month ago

So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? https://github.com/ggerganov/llama.cpp/pull/7188 I assume this in not implemented for this yet, any chance?

LostRuins commented 1 month ago

No, it's not implemented yet. I will merge it for the next version

ss4elby commented 1 month ago

Appreciated, your work is something amazing!

Spacellary commented 1 month ago

Truly a joyous occasion! This looks very promising!

LostRuins commented 1 month ago

Hi, can see if this works fine for you on the latest version?

gustrd commented 1 month ago

I checked with my old MX150 and now it works.

The llama.cpp upgrade to CUDA without tensor cores must have solved it. The prompt processing speed is higher now (around 2x faster), but the generation a bit slower (around 20%). But this is a good tradeoff in the end.

ss4elby commented 1 month ago

It seems to work fine, holy hell its quick too. Thank you!