Closed ss4elby closed 1 month ago
No, it's not implemented yet. I will merge it for the next version
Appreciated, your work is something amazing!
Truly a joyous occasion! This looks very promising!
Hi, can see if this works fine for you on the latest version?
I checked with my old MX150 and now it works.
The llama.cpp upgrade to CUDA without tensor cores must have solved it. The prompt processing speed is higher now (around 2x faster), but the generation a bit slower (around 20%). But this is a good tradeoff in the end.
It seems to work fine, holy hell its quick too. Thank you!
So I noticed it runs WAY slow, then realized my card was not set up for that, I am running ye oldie p40. So no tensor cores. But this fellow over at flash attention apparently made it possible to work without them? https://github.com/ggerganov/llama.cpp/pull/7188 I assume this in not implemented for this yet, any chance?