LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.83k stars 342 forks source link

Out of VRAM even with zero GPU layers due to BLAS for llama-405 and mistral-large. #1082

Open AphidGit opened 3 weeks ago

AphidGit commented 3 weeks ago

BLAS memory usage for mistral-large-2 and llama-405B, when using their full context capabilities (128K) is too large for any affordable single GPU to handle.

For example, in the most extreme possible case, for llama-405B, full context comes in at 95.16GB, and the model will not run with any GPU on the planet on koboldcpp. With 24GB GPUs, you'd be restricted to on the order of 25-30K context.

What are the possibilities of dealing with this? Could multi-gpu support for BLAS processing be implemented? Long context is one of the cases where local LLM generation is superior to using services, as the caching allows the prompt processing to only be done once, rather than once with every reply. (A case of Schlemiel the painter, where local is O(N) while remote is O(N^2) as the old prompts and replies are processed with every new prompt.)

LostRuins commented 3 weeks ago

You could try using a smaller quant, or reducing the blas batch size.