server: process prompt fairly accross slots

ggerganov / llama.cpp

LLM inference in C/C++

MIT License

61.11k stars 8.72k forks source link

server: process prompt fairly accross slots #6607

Open phymbert opened 2 months ago

phymbert commented 2 months ago

Context

At the moment we implement a FIFO approach to batch prompt tokens. So if a large prompt is to be processed it blocks all other slots.

Proposal: implement a fair batch usage of prompt processing accross all pending slots.

References:

https://github.com/ggerganov/llama.cpp/issues/4216#issuecomment-2043558080
https://github.com/ggerganov/llama.cpp/issues/5851#issuecomment-1975120585

pudepiedj commented 2 months ago

@phymbert What would be the side-effects (or other objections/snags) of adding a SLOT_STATE_RESERVED status to the two present slot states SLOT_STATE_IDLE and SLOT_STATE_PROCESSING that allowed some slots to be kept in reserve for new prompts or running chats so that new requests don't bump them? It struck me when I was playing with my slot graphics that this might be desirable and now it has emerged as an issue, so what do you think?