Closed shamanez closed 1 month ago
The cot_decoding is a sampling technique, it cannot be implemented via API. You can see that in the implementation we load the model and access the logits of tokens during the forward pass. The current implementation is in pytorch you can try it in the free Google Colab - https://colab.research.google.com/drive/1SpuUb8d9xAoTh32M-9wJsB50AOH54EaH?usp=sharing
Yeah, exactly. I've already tried the notebook and it's amazing. Also, there are some other sampling methods that are getting popular. For example: Entropy-based methods.
What I like about the amazing OptILLM is that we can plug this into the API and it's super production-friendly. My question is: do you have any thoughts on integrating sampling methods?
Yes, I am open to adding more sampling techniques. The challenge is that PyTorch implementations of such techniques may not be mode efficient so someone will need to implement for different backends like vllm, llama.cpp etc. I had a look at the entropy work, once their repo is a bit established I can implement the basic idea similar to cot decoding in optillm as well.
You can actually integrate any technique via plugin btw, just implement a run function in a Python script and define a slug and put it in the /plugins folder. You can then use it from optillm like the existing techniques. I have implemented the privacy, memory, code execution plugins as examples in the folder.
ou can actually integrate any technique via plugin btw, just implement a run function in a Python script and define a slug and put it in the /plugins folder. You can then use it from optillm like the existing techniques. I have implemented the privacy, memory, code execution plugins as examples in the folder.
Legend!!!!
Could you please share one example of how to use a plugin here :) . Sorry I I bother you with dumb questions.
I added a short note on that on the wiki - https://github.com/codelion/optillm/wiki/Build-your-own-plugin
Let me know if it is not clear or you run into issues, and I can expand on that.
Thanks a lot
@codelion I implemented a plugin, but AFAIU , then we need to load the model locally right?
What does your plugin do? If it is loading a model then yes, otherwise you can just call the underlying model with the client that is passed in the run method. We show how to do that in the memory plugin - https://github.com/codelion/optillm/blob/193ab3c4d54f5f2e2c47525293bd7827b609675f/optillm/plugins/memory_plugin.py#L126
I need to use the model to get the logits. So basically I need to use model.generate()
Ideally, this is what I want to do:
# Assuming that a model is a dictionary containing both 'model' and 'tokenizer'
proxy_model = model.get('model')
proxy_tokenizer = model.get('tokenizer')
## rest of the sampling code
@codelion :)
Yes, this will require loading the model. If you run it on a machine with GPU it may still be fast enough depending on how big your model is.
Here's the corrected version with improved grammar:
Yeah, I understand. I first thought we could do a workaround to load the model where we have the server endpoint. But it seems like it's a bit difficult, right?
So, as you suggested, the best thing would be to come up with some other API that returns the logits.
Is my understanding correct?
It’s not about the API. To do what you are suggesting requires interfacing with the model during the forward pass. So, we cannot do it with closed models like gpt-4o or Gemini that are only available via the API. We can implement it using PyTorch for open weight models like llama3.2 but then yes we will need to do inference within the proxy. So, it will be slow unless you run on a machine with gpu.
Yes, I understand. We can't do this with closed-form models since they don't give logits. In my case, I use a custom llama3.1 model, but my hosted model is still wrapped with the OpenAI API. So what I would do is run the proxy on a GPU. Thanks a lot, and your proposed method is working.
Is there any reason why this is not still included?
I am talking about this method - https://github.com/codelion/optillm/blob/main/optillm/cot_decoding.py
I can see this method is not available for the proxy. Is it because of the openai client ?
Do you suggest any workaround for this?