Open Mushoz opened 1 week ago
For 1, I haven’t tried the np switch, can you point me to some documentation that describes what it does?
There is a PR open to support 2 here - https://github.com/codelion/optillm/pull/83 but it will not be the same as sampling multiple responses in the same request using the ‘n’ parameter like the OpenAI client can do. Some API providers also do not support it, e.g Claude doesn’t support it but Gemini does. Once the pr is merged I can try benchmarking again.
Sure! The documentation can be read here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md
Though I think I misunderstood the OptiLLM warning initially. Does it mean that for chatgpt a single request will have a parameter that signifies multiple responses are wanted? In that case, the -np switch is not going to make a difference. The -np switch simply allows the backend to process multiple requests in parallel, but it doesn't allow a single request to request multiple responses.
But that doesn't mean the -np switch is useless though. Instead of a single request requesting multiple responses, OptiLLM could send out multiple requests asynchronously with the same prompt. Those multiple requests would then be able to be processed in parallel. That way, OptiLLM still obtains multiple responses, although not as elegant perhaps as the chatgpt API that supports multiple responses for a single request natively.
It's good to hear that 2. will be supported as a nice stopgap solution!
Yes, the n parameter allows extracting multiple responses from the same request during decoding. This is not the same as sending multiple requests. On top of that many providers will do caching so we will need to change the seed with every request or use some other hack to avoid that. That’s the reason why it won’t work the same as using the n parameter.
To be fair though, the PR to support 2 will run into the exact same issues with caching. I still think it's worthwhile to have the option, since caches can be disabled in some cases (when running locally). But doing 2 through multiple requests done in parallel will be much faster than doing the required requests sequentially.
Caches could potentially also be invalidated to include a nonce at the beginning of the system prompt, but that does indeed sound a little bit hacky.
In the README the following warning can be read:
"Note that the Anthropic API, llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches"
I have two questions regarding this warning: