codelion / optillm

Optimizing inference proxy for LLMs
Apache License 2.0
1.64k stars 130 forks source link

Warning: Do not support sampling multiple responses #99

Open Mushoz opened 1 week ago

Mushoz commented 1 week ago

In the README the following warning can be read:

"Note that the Anthropic API, llama-server (and ollama) currently does not support sampling multiple responses from a model, which limits the available approaches"

I have two questions regarding this warning:

  1. Is this accurate? Llama-server allows for the -np switch which will allow for decoding parallel requests. Should this allow the MOA approach to work for example?
  2. Would it be possible to do these requests sequentially instead of in parallel? I understand that this won't be ideal for speed, but it's better than not working at all.
codelion commented 1 week ago

For 1, I haven’t tried the np switch, can you point me to some documentation that describes what it does?

There is a PR open to support 2 here - https://github.com/codelion/optillm/pull/83 but it will not be the same as sampling multiple responses in the same request using the ‘n’ parameter like the OpenAI client can do. Some API providers also do not support it, e.g Claude doesn’t support it but Gemini does. Once the pr is merged I can try benchmarking again.

Mushoz commented 1 week ago

Sure! The documentation can be read here: https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md

Though I think I misunderstood the OptiLLM warning initially. Does it mean that for chatgpt a single request will have a parameter that signifies multiple responses are wanted? In that case, the -np switch is not going to make a difference. The -np switch simply allows the backend to process multiple requests in parallel, but it doesn't allow a single request to request multiple responses.

But that doesn't mean the -np switch is useless though. Instead of a single request requesting multiple responses, OptiLLM could send out multiple requests asynchronously with the same prompt. Those multiple requests would then be able to be processed in parallel. That way, OptiLLM still obtains multiple responses, although not as elegant perhaps as the chatgpt API that supports multiple responses for a single request natively.

It's good to hear that 2. will be supported as a nice stopgap solution!

codelion commented 1 week ago

Yes, the n parameter allows extracting multiple responses from the same request during decoding. This is not the same as sending multiple requests. On top of that many providers will do caching so we will need to change the seed with every request or use some other hack to avoid that. That’s the reason why it won’t work the same as using the n parameter.

Mushoz commented 1 week ago

To be fair though, the PR to support 2 will run into the exact same issues with caching. I still think it's worthwhile to have the option, since caches can be disabled in some cases (when running locally). But doing 2 through multiple requests done in parallel will be much faster than doing the required requests sequentially.

Caches could potentially also be invalidated to include a nonce at the beginning of the system prompt, but that does indeed sound a little bit hacky.