Open Yiannis128 opened 1 year ago
@Yiannis128 i'm the maintainer of LiteLLM we allow you to increase your throughput - load balance between multiple deployments (Azure, OpenAI) I'd love your feedback, especially if this does not solve your problem
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router
model_list = [{ # list of model deployments
"model_name": "gpt-3.5-turbo", # model alias
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-v-2", # actual model name
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "azure/chatgpt-functioncalling",
"api_key": os.getenv("AZURE_API_KEY"),
"api_version": os.getenv("AZURE_API_VERSION"),
"api_base": os.getenv("AZURE_API_BASE")
}
}, {
"model_name": "gpt-3.5-turbo",
"litellm_params": { # params for litellm completion/embedding call
"model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ",
"api_key": os.getenv("OPENAI_API_KEY"),
}
}]
router = Router(model_list=model_list)
# openai.ChatCompletion.create replacement
response = router.completion(model="gpt-3.5-turbo",
messages=[{"role": "user", "content": "Hey, how's it going?"}])
print(response)
@Yiannis128 i'm the maintainer of LiteLLM we allow you to increase your throughput - load balance between multiple deployments (Azure, OpenAI) I'd love your feedback, especially if this does not solve your problem
Here's how to use it Docs: https://docs.litellm.ai/docs/routing
from litellm import Router model_list = [{ # list of model deployments "model_name": "gpt-3.5-turbo", # model alias "litellm_params": { # params for litellm completion/embedding call "model": "azure/chatgpt-v-2", # actual model name "api_key": os.getenv("AZURE_API_KEY"), "api_version": os.getenv("AZURE_API_VERSION"), "api_base": os.getenv("AZURE_API_BASE") } }, { "model_name": "gpt-3.5-turbo", "litellm_params": { # params for litellm completion/embedding call "model": "azure/chatgpt-functioncalling", "api_key": os.getenv("AZURE_API_KEY"), "api_version": os.getenv("AZURE_API_VERSION"), "api_base": os.getenv("AZURE_API_BASE") } }, { "model_name": "gpt-3.5-turbo", "litellm_params": { # params for litellm completion/embedding call "model": "vllm/TheBloke/Marcoroni-70B-v1-AWQ", "api_key": os.getenv("OPENAI_API_KEY"), } }] router = Router(model_list=model_list) # openai.ChatCompletion.create replacement response = router.completion(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hey, how's it going?"}]) print(response)
Hi, thanks for the suggestion, before I look at this, I would like to ask if you have a hugging face model uploaded? Since I already have hugging face model support.
I will still look at it if you don't but if you do it will be much easier to implement.
Yes we support hugging face llms - are you trying to load balance between hugging face endpoints ?
Yes we support hugging face llms - are you trying to load balance between hugging face endpoints ?
No, I only ask it because I have an interface for adding text-generation-inference compatible models through hugging face. So if you do, which is great! Is this an alternative to langchain? Could you inform me of the positives you have?
Langchain had some limitations when I implemented it, not sure about now, so I'm weighing cost/benefit of considering to switch :)
When ESBMC has an output (counterexample) that is too big, then the token size is too large to be passed to the LLM. Currently, due to switching to LangChain, we don't measure and check if the current tokens have been surpassed. When the error occurs, LangChain will output it, please see Example.
Example (From the FormAI dataset FormAI_92991.c):