Remove some old Cuda implementation from KCPP and realign on LCPP

Nexesenex commented 3 days ago

This once beneficial implementation is now deprecated and harmful to lowvram mode at the very least. Confirmed after extensive benching to function properly in Frankenstein KCPP v 1.68Y, lowvram is now working as intended in the many cases it didn't work anymore (notably with Flash attention, posterior to these fixes). PP is normal again (not a tenth or even less compared to w/o lowvram, notably in dual GPU) , TG doesn't go down or even improves a bit, and there's no coherence fault (when used with MMQ (it happens) or crash anymore.

Remaining problem : the TG with FA is still 30-35% inferior to the TG without FA when lowvram is used. While PP, on the other hand, is 10 times faster (Lowvram FA vs Lowvram no FA). Go figure!

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [x] Medium
- [ ] High

LostRuins commented 3 days ago

I'm not sure why you would observe an improvement using the non-deallocation approach. Are you sure you are benching this correctly? It should not have any impact on lowvram mode at all - it's just used when allocating memory on GPU

The way memory is allocated should be very similar to how upstream does it, but with the added advantage that when when the pool has an available slot but cannot find a buffer that fits, it will resize largest one to save memory (when not using mmq). Whereas for upstream, it will just attempt to allocate a new larger one, and you end up with multiple buffers that are too small to be used but left there.

Can you verify again that just this change alone provides a speedup at no increased GPU memory usage (try both with and without mmq)

Nexesenex commented 3 days ago

That doesn't make any sense indeed.

Edit : I retested fully, and I messed up my bench indeed. One test was with one thread 1, the other was with 6 threads. That's what explained the bad results before, and the good ones after, not this PR. I apologize for this mistake. PR closed.

LostRuins / koboldcpp

Remove some old Cuda implementation from KCPP and realign on LCPP #954