codelion / optillm

Optimizing inference proxy for LLMs
Apache License 2.0
1.64k stars 130 forks source link

I can see cot_decode method has implemented, but we can't use it with the proxy. #59

Closed shamanez closed 1 month ago

shamanez commented 1 month ago

Is there any reason why this is not still included?

I am talking about this method - https://github.com/codelion/optillm/blob/main/optillm/cot_decoding.py

I can see this method is not available for the proxy. Is it because of the openai client ?

Do you suggest any workaround for this?

codelion commented 1 month ago

The cot_decoding is a sampling technique, it cannot be implemented via API. You can see that in the implementation we load the model and access the logits of tokens during the forward pass. The current implementation is in pytorch you can try it in the free Google Colab - https://colab.research.google.com/drive/1SpuUb8d9xAoTh32M-9wJsB50AOH54EaH?usp=sharing

shamanez commented 1 month ago

Yeah, exactly. I've already tried the notebook and it's amazing. Also, there are some other sampling methods that are getting popular. For example: Entropy-based methods.

What I like about the amazing OptILLM is that we can plug this into the API and it's super production-friendly. My question is: do you have any thoughts on integrating sampling methods?

codelion commented 1 month ago

Yes, I am open to adding more sampling techniques. The challenge is that PyTorch implementations of such techniques may not be mode efficient so someone will need to implement for different backends like vllm, llama.cpp etc. I had a look at the entropy work, once their repo is a bit established I can implement the basic idea similar to cot decoding in optillm as well.

You can actually integrate any technique via plugin btw, just implement a run function in a Python script and define a slug and put it in the /plugins folder. You can then use it from optillm like the existing techniques. I have implemented the privacy, memory, code execution plugins as examples in the folder.

shamanez commented 1 month ago

ou can actually integrate any technique via plugin btw, just implement a run function in a Python script and define a slug and put it in the /plugins folder. You can then use it from optillm like the existing techniques. I have implemented the privacy, memory, code execution plugins as examples in the folder.

Legend!!!!

Could you please share one example of how to use a plugin here :) . Sorry I I bother you with dumb questions.

codelion commented 1 month ago

I added a short note on that on the wiki - https://github.com/codelion/optillm/wiki/Build-your-own-plugin

Let me know if it is not clear or you run into issues, and I can expand on that.

shamanez commented 1 month ago

Thanks a lot

shamanez commented 1 month ago

@codelion I implemented a plugin, but AFAIU , then we need to load the model locally right?

codelion commented 1 month ago

What does your plugin do? If it is loading a model then yes, otherwise you can just call the underlying model with the client that is passed in the run method. We show how to do that in the memory plugin - https://github.com/codelion/optillm/blob/193ab3c4d54f5f2e2c47525293bd7827b609675f/optillm/plugins/memory_plugin.py#L126

shamanez commented 1 month ago

I need to use the model to get the logits. So basically I need to use model.generate()

Ideally, this is what I want to do:

        # Assuming that a model is a dictionary containing both 'model' and 'tokenizer'
        proxy_model = model.get('model')
        proxy_tokenizer = model.get('tokenizer')

        ## rest of the sampling code

@codelion :)

codelion commented 1 month ago

Yes, this will require loading the model. If you run it on a machine with GPU it may still be fast enough depending on how big your model is.

shamanez commented 1 month ago

Here's the corrected version with improved grammar:

Yeah, I understand. I first thought we could do a workaround to load the model where we have the server endpoint. But it seems like it's a bit difficult, right?

So, as you suggested, the best thing would be to come up with some other API that returns the logits.

Is my understanding correct?

codelion commented 1 month ago

It’s not about the API. To do what you are suggesting requires interfacing with the model during the forward pass. So, we cannot do it with closed models like gpt-4o or Gemini that are only available via the API. We can implement it using PyTorch for open weight models like llama3.2 but then yes we will need to do inference within the proxy. So, it will be slow unless you run on a machine with gpu.

shamanez commented 1 month ago

Yes, I understand. We can't do this with closed-form models since they don't give logits. In my case, I use a custom llama3.1 model, but my hosted model is still wrapped with the OpenAI API. So what I would do is run the proxy on a GPU. Thanks a lot, and your proposed method is working.