BLAS memory usage for mistral-large-2 and llama-405B, when using their full context capabilities (128K) is too large for any affordable single GPU to handle.
For example, in the most extreme possible case, for llama-405B, full context comes in at 95.16GB, and the model will not run with any GPU on the planet on koboldcpp. With 24GB GPUs, you'd be restricted to on the order of 25-30K context.
What are the possibilities of dealing with this? Could multi-gpu support for BLAS processing be implemented? Long context is one of the cases where local LLM generation is superior to using services, as the caching allows the prompt processing to only be done once, rather than once with every reply. (A case of Schlemiel the painter, where local is O(N) while remote is O(N^2) as the old prompts and replies are processed with every new prompt.)
BLAS memory usage for mistral-large-2 and llama-405B, when using their full context capabilities (128K) is too large for any affordable single GPU to handle.
For example, in the most extreme possible case, for llama-405B, full context comes in at 95.16GB, and the model will not run with any GPU on the planet on koboldcpp. With 24GB GPUs, you'd be restricted to on the order of 25-30K context.
What are the possibilities of dealing with this? Could multi-gpu support for BLAS processing be implemented? Long context is one of the cases where local LLM generation is superior to using services, as the caching allows the prompt processing to only be done once, rather than once with every reply. (A case of Schlemiel the painter, where local is O(N) while remote is O(N^2) as the old prompts and replies are processed with every new prompt.)