BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.24k stars 1.42k forks source link

[Feature]: reuse vertex_ai client #1938

Closed dumbPy closed 3 months ago

dumbPy commented 7 months ago

The Feature

Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.

Below is the experiment I did regenerating with fresh client (like litellm) and reusing the model (and client internally) on langchain.

Note that the time is higher in litellm since the input is not exactly same (difference in role probably?) but the mean and std deviation are consistent.

from langchain_community.chat_models.vertexai import ChatVertexAI
from litellm import completion

with litellm

%%timeit -n 1 -r 10
completion(model="gemini-pro", messages=[{"role": "user", "content": "write code for saying hi from LiteLLM"}])

4.64 s ± 1.05 s per loop (mean ± std. dev. of 10 runs, 1 loop each)

with langchain (fresh client)

%%timeit -n 1 -r 10
ChatVertexAI(model_name="gemini-pro").invoke("write code for saying hi from LiteLLM")

2.44 s ± 102 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

with langchain (reuse model)

model = ChatVertexAI(model_name="gemini-pro")
%%timeit -n 1 -r 1
model.invoke("write code for saying hi from LiteLLM")

2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

%%timeit -n 1 -r 10
model.invoke("write code for saying hi from LiteLLM")

1.41 s ± 311 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

Motivation, pitch

Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.

Twitter / LinkedIn details

https://www.linkedin.com/in/sufiyanadhikari/

krrishdholakia commented 7 months ago

Hey @dumbPy it's weird you were seeing the time diff for the client. Translation shouldn't be adding 2s.

I'll investigate this on our end as well. And yes - i do agree - we can definitely reuse the vertex ai client.

dumbPy commented 7 months ago

@krrishdholakia The outputs for the same input are different. in litellm I am consistently getting longer output and hence the time is higher. might be because of different prompt translation and temperature, and can be looked separately.

The only thing I am points out here is the difference between fresh and reused client time in langchain. In fresh, it's 2.44 s ± 102 ms while in reused it drops to 1.41 s ± 311 ms

arnaud-secondlayer commented 6 months ago

Any progress on this issue? Do you have an architecture in mind to perform client caching? Would be happy to help once the architecture is decided.

krrishdholakia commented 6 months ago

hey @arnaud-secondlayer, the place to add this would be in set_client in router, where we already do this for the openai / azure clients - https://github.com/BerriAI/litellm/blob/5edb703d781a9a7a2d9ba98205669eb9d95a1680/litellm/router.py#L1740

Happy to do a quick call this week to talk through this, if that helps - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat

ishaan-jaff commented 3 months ago

this is fixed on 1.40.2 - we cache vertex ai clients @arnaud-secondlayer @dumbPy