[Feature]: Improve startup times

CGamesPlay commented 7 months ago

The Feature

Litellm has a few heavy initialization functions that are not used 100% of the time but add considerable startup time to the package. It would be great to modify these to be lazy-loaded. Two specific resources I've noticed:

Always loading the cl100k_base tokenizer - even if it won't be used, this takes about 150ms on my M1 MacBook.
Always loading the model cost map - even if it won't be used, via a network request. This adds up to 5s to the startup time depending in case the user is on a slow network connection (e.g. tethering).

Motivation, pitch

Faster startup times will 1000x developer productivity while revolutionizing new uses cases.

Twitter / LinkedIn details

No response

krrishdholakia commented 7 months ago

Hey @CGamesPlay, we have both cached as well. What would a good lazy loading implementation here look like?

CGamesPlay commented 7 months ago

Yes, they are cached... at startup. This loading process takes time, though, which is what would be great to improve. To see what I mean: time python -c 'import litellm'. Any program that uses litellm at all takes at least this amount of time to start up, even if it never calls a single litellm function (e.g. mytool --help has to pay this startup cost).

The fixes should be pretty simple, I think, but I haven't tested these.

# Instead of
encoding = tiktoken.get_encoding("cl100k_base")

# Use
import functools

@functools.cache
def get_encoding():
    return tiktoken.get_encoding("cl100k_base")

# Instead of
model_cost = get_model_cost_map(url=model_cost_map_url)

# Use
import functools

@functools.cache
def get_model_cost():
    # In reality, we probably want to refactor the existing method to use a default parameter.
    # Fun fact: functools.cache methods have a property `__wrapped__` which is the uncached function.
    return get_model_cost_map(url=model_cost_map_url)

CGamesPlay commented 7 months ago

I took a deeper look at this and tried to create a PR, but failed. model_cost is a public export. It's possible to make it lazy-loaded at use by subclassing dict and only loading once a key is accessed, except a bunch of other public exports also expect the model cost map to be loaded at startup, notably lists, which can't be lazy-loaded the same way. As a workaround, I could set LITELLM_LOCAL_MODEL_COST_MAP=True before I import litellm.

I've also discovered that litellm always calls dotenv.load_dotenv() with the default configuration, meaning my application cannot disable this behavior if, say, the user requested a different env file be loaded.

Considering all these, I unfortunately think litellm is unsuitable for my use case, which is a CLI application that leverages LLMs, so I'll be migrating away from it. Hopefully in the next major version (since these changes would require a major version bump) litellm will be less demanding of its execution environment (not modify os.environ, not make network requests on startup, do less unnecessary loading).

krrishdholakia commented 7 months ago

Hey @CGamesPlay, reviewing the feedback:

not make network requests on startup

I believe LITELLM_LOCAL_MODEL_COST_MAP should solve this. Let me know if not?

not modify os.environ

Can you explain this a bit - how are we modifying the os.environ?

do less unnecessary loading

What would this look like?

CGamesPlay commented 7 months ago

Sure thing. As I mentioned, my use case is a CLI utility. People will be using this from all over their system, in any directory. That said:

Now I have to set LITELLM_LOCAL_MODEL_COST_MAP myself, in my program, before I import litellm, or risk delaying startup by 5s. That's not ideal, but it is a workaround.
Litellm calls dotenv.load_dotenv(), loading whatever random .env file is in whatever random directory the user runs my script in, modifying os.environ.
This would look like not calling get_model_cost_map or tiktoken.get_encoding at the top level. This would look like time python -c 'import litellm' giving results closer to time python -c 'import openai' (on my system, this is presently 700ms vs 350ms, so double the startup time). I gave some specific examples in my previous comment.

No ill will towards you guys, and I would love to see a unified LLM interface in python to allow me to easily swap out model providers, but as it stands litellm won't be that for my project.

thiswillbeyourgithub commented 5 months ago

I too am building cli utility and noticed that litellm was the main culprit for my slow startup time. Including utility supposed to quickly listen to your voice.

I am surprised to hear that it's in good part caused by the cost loading code as that to me seems like something that should not be run on every startup but when needed (lazy loading).

It's possible to make it lazy-loaded at use by subclassing dict and only loading once a key is accessed,

That was my first thought as well.

except a bunch of other public exports also expect the model cost map to be loaded at startup, notably lists, which can't be lazy-loaded the same way.

Aouch. That does not seem to be super easy to change but does seem worth it to me. I mean litellm is a complete outlier compared to other libs:

❯ time python python 0,09s user 0,06s system 8% cpu 1,669 total ❯ time python -c "import langchain ; import langchain_core ; import langchain_community" python -c "import langchain" 0,14s user 0,06s system 100% cpu 0,193 total ❯ time python -c "import numpy" python -c "import numpy" 0,65s user 0,61s system 326% cpu 0,388 total ❯ time python -c 'import openai' python -c 'import openai' 1,48s user 0,18s system 99% cpu 1,669 tota ❯ time python -c "import torch" python -c "import torch" 2,73s user 0,88s system 124% cpu 2,891 total ❯ time python -c "import torchaudio" python -c "import torchaudio" 3,66s user 1,32s system 141% cpu 3,521 total ❯ time python -c "import litellm" python -c "import litellm" 4,27s user 0,96s system 116% cpu 4,473 total

So Litellm is 27 times slower to import than langchain, 7 times slower than numpy, 2.8 times slower than openai, 1.6 times slower than torch, 1.2 times slower than torchvision.

I picked these libs pretty randomly but that certainly seems unecessary. I can understand why that would be a deal breaker for @CGamesPlay and for many corporation.

I am still very fond of litellm and use it in probably at least a dozen of my repos but I think this should be of pretty high priority. I didn't look at the code but intuitively, lazy loading of costs when the dict is accessed seems like the way to go. I don't know for what reasons some other parts of the code would need the cost dict on startup.

krrishdholakia commented 5 months ago

hey @thiswillbeyourgithub can you check if this is resolved by just using the locally cached dictionary -

export LITELLM_LOCAL_MODEL_COST_MAP="True"

https://docs.litellm.ai/docs/completion/token_usage#9-register_model

thiswillbeyourgithub commented 5 months ago

Hi, I'm sorry I somehow missed your message.

The answer is no, both are still slow:

❯ time python -c "import litellm"
python -c "import litellm"  4,49s user 1,00s system 112% cpu 4,897 total
❯ export LITELLM_LOCAL_MODEL_COST_MAP="True"
❯ time python -c "import litellm"
python -c "import litellm"  4,48s user 1,02s system 118% cpu 4,655 total

litellm version: 1.39.6

lucasgadams commented 2 weeks ago

Any update here? It is pretty bad form to be making network requests at package import time.

krrishdholakia commented 2 weeks ago

@lucasgadams this can be disabled with an env var -

the hosted copy is used to prevent needing to make continuous version upgrades as new models come out

BerriAI / litellm