LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Remove some old Cuda implementation from KCPP and realign on LCPP #954

Closed Nexesenex closed 3 days ago

Nexesenex commented 3 days ago

This once beneficial implementation is now deprecated and harmful to lowvram mode at the very least. Confirmed after extensive benching to function properly in Frankenstein KCPP v 1.68Y, lowvram is now working as intended in the many cases it didn't work anymore (notably with Flash attention, posterior to these fixes). PP is normal again (not a tenth or even less compared to w/o lowvram, notably in dual GPU) , TG doesn't go down or even improves a bit, and there's no coherence fault (when used with MMQ (it happens) or crash anymore.

Remaining problem : the TG with FA is still 30-35% inferior to the TG without FA when lowvram is used. While PP, on the other hand, is 10 times faster (Lowvram FA vs Lowvram no FA). Go figure!

LostRuins commented 3 days ago

I'm not sure why you would observe an improvement using the non-deallocation approach. Are you sure you are benching this correctly? It should not have any impact on lowvram mode at all - it's just used when allocating memory on GPU

The way memory is allocated should be very similar to how upstream does it, but with the added advantage that when when the pool has an available slot but cannot find a buffer that fits, it will resize largest one to save memory (when not using mmq). Whereas for upstream, it will just attempt to allocate a new larger one, and you end up with multiple buffers that are too small to be used but left there.

Can you verify again that just this change alone provides a speedup at no increased GPU memory usage (try both with and without mmq)

Nexesenex commented 3 days ago

That doesn't make any sense indeed.

Edit : I retested fully, and I messed up my bench indeed. One test was with one thread 1, the other was with 6 threads. That's what explained the bad results before, and the good ones after, not this PR. I apologize for this mistake. PR closed.