codelion / optillm

Optimizing inference proxy for LLMs
Apache License 2.0
1.64k stars 130 forks source link

Implement cot decoding with llama.cpp #65

Closed codelion closed 1 week ago

henryclw commented 1 month ago

Hi there, I'm trying to do the same thing as well. Currently I'm working with https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#low-level-api Hope this might be helpful to you as well. Cheers.

codelion commented 1 month ago

Hey that is great, it will be lovely if you can submit a PR, you can implement it as a plugin in optillm.

henryclw commented 1 month ago

@codelion Hi, I just encounter some difficulties as I try to push forward. As in this repo, a lot of methods are in fact custom sampling. And they require:

  1. when predicting, the model should give some possible next tokens with their corresponding probabilities, rather than just a token which is already sampled from the distribution.
  2. ability to evaluate some pre-exist context, returning the probability of the given context.

However, I checked both the low-level api in llama-cpp-python and vllm, they are kind of hard to work around and give these two abilities as functions/plugins. But this is still doable, the probabilities are there. Another method, is what I call a strange and ugly coding method, is to utilize the current llama.cpp HTTP API, no need to configure more. And this is also capable with ollama as well. I do know lots of people use ollama due to it's easy set-up compared to llama.cpp. This method is to work around the HTTP API, especially the n_probs and grammar. In this case, we could force to model to give us the distribution (but only for the most possible tokens, not many) and the ability to evaluate some pre-exist context. Although this is an ugly method, but it may serve more people and less likely to have dependency breaks.

I'm writing this is ask for advice and discuss which method should we take. Thank you.

codelion commented 1 month ago

The best way to implement it may be in C++ directly here - https://github.com/ggerganov/llama.cpp/blob/master/src/llama-sampling.cpp

codelion commented 1 week ago

We now have cot_decoding directly implemented in optillm via the local inference server - https://github.com/codelion/optillm?tab=readme-ov-file#local-inference-server

Closing this ticket as this work needs to be implemented in llama.cpp, there is a dicsussion open in their repo here - https://github.com/ggerganov/llama.cpp/discussions/9620