Closed dumbPy closed 3 months ago
Hey @dumbPy it's weird you were seeing the time diff for the client. Translation shouldn't be adding 2s.
I'll investigate this on our end as well. And yes - i do agree - we can definitely reuse the vertex ai client.
@krrishdholakia The outputs for the same input are different. in litellm I am consistently getting longer output and hence the time is higher. might be because of different prompt translation and temperature, and can be looked separately.
The only thing I am points out here is the difference between fresh and reused client time in langchain. In fresh, it's 2.44 s ± 102 ms
while in reused it drops to 1.41 s ± 311 ms
Any progress on this issue? Do you have an architecture in mind to perform client caching? Would be happy to help once the architecture is decided.
hey @arnaud-secondlayer, the place to add this would be in set_client
in router, where we already do this for the openai / azure clients - https://github.com/BerriAI/litellm/blob/5edb703d781a9a7a2d9ba98205669eb9d95a1680/litellm/router.py#L1740
Happy to do a quick call this week to talk through this, if that helps - https://calendly.com/d/4mp-gd3-k5k/litellm-1-1-onboarding-chat
this is fixed on 1.40.2 - we cache vertex ai clients @arnaud-secondlayer @dumbPy
The Feature
Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.
Below is the experiment I did regenerating with fresh client (like litellm) and reusing the model (and client internally) on langchain.
Note that the time is higher in litellm since the input is not exactly same (difference in role probably?) but the mean and std deviation are consistent.
with litellm
4.64 s ± 1.05 s per loop (mean ± std. dev. of 10 runs, 1 loop each)
with langchain (fresh client)
2.44 s ± 102 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
with langchain (reuse model)
2.46 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)
1.41 s ± 311 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
Motivation, pitch
Vertex AI seem to have an overhead (probably auth related?) and hence needs reusing the client. for faster response.
Twitter / LinkedIn details
https://www.linkedin.com/in/sufiyanadhikari/