BerriAI / litellm

Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
13.2k stars 1.54k forks source link

[Bug]: It seems that a memory leak is occurring in Async Streaming Completion (acompletion). #6404

Open tyler-liner opened 22 hours ago

tyler-liner commented 22 hours ago

What happened?

fastapi==0.115.2 langfuse==2.45.0 litellm==1.50.1

Hello,

I am in the process of developing an LLM-related application using FastAPI + LiteLLM + Langfuse. However, I have noticed a continuous increase in memory usage as more requests are processed. Upon further investigation, I observed that the memory consumption increases with each call to LiteLLM’s Acompletion.

Since the application code involves multiple complex packages, I am attaching a minimal reproducible example that demonstrates the increase in memory usage.​

import litellm
from fastapi import FastAPI
from memory_profiler import profile

app = FastAPI()

from pydantic import BaseModel

class ExampleRequest(BaseModel):
    query: str

@app.post("/debug")
async def debug(body: ExampleRequest) -> str:
    return await main_logic(body.query)

@profile
async def main_logic(query) -> str:
    stream = await litellm.acompletion(
        model="gpt-4o-mini",
        api_key=<api_key>
        messages=[{"role": "user", "content": query}],
        stream=True,
    )
    result = ""
    async for chunk in stream:
        result += chunk.choices[0].delta.content or ""
    return result

By running the server as shown above and continuously sending requests, the memory usage increases linearly.

When observing with memory_profiler, it is noticeable that the memory usage increases by 0-0.1MiB at the stream = await litellm.acompletion line and by 0.4-0.5MiB at the async for chunk in stream line with every request. This can be reproduced by continuously sending requests containing the same query at 1-second intervals.

Is there an appropriate solution for this? About 0.5MB of memory keeps accumulating per request.

Thank you.

Relevant log output

No response

Twitter / LinkedIn details

No response

ishaan-jaff commented 19 hours ago

hi @tyler-liner

tyler-liner commented 10 hours ago

hi @ishaan-jaff

In the actual program built using LiteLLM, I am using a router, and the same issue occurs. The code attached in the issue description is a minimal reproducible example where the memory leak occurs.

  • what does your memory profiler show as increasing your memory usage ? typically memory profilers show the block of code allocating memory

When observing with memory_profiler, it is noticeable that the memory usage increases by 0-0.1MiB at thestream = await litellm.acompletion line and by 0.4-0.5MiB at the async for chunk in stream line with every request. Below is the output from the profiler. Even though the requests were sent with intervals, the memory usage continues to accumulate.

INFO:     127.0.0.1:54455 - "POST /debug HTTP/1.1" 200 OK
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    45    414.2 MiB    414.2 MiB           1   @profile
    46                                         async def main_logic(query) -> str:
    47    414.3 MiB      0.1 MiB          15       stream = await router.acompletion(
    48    414.2 MiB      0.0 MiB           1           model="gpt-4o-mini",
    49    414.2 MiB      0.0 MiB           1           api_key=config.openai_api_keys[-1],
    50    414.2 MiB      0.0 MiB           1           messages=[{"role": "user", "content": query}],
    51    414.2 MiB      0.0 MiB           1           stream=True,
    52                                             )
    53    414.3 MiB      0.0 MiB           1       result = ""
    54    415.3 MiB      1.0 MiB         491       async for chunk in stream: 
    55    415.1 MiB      0.0 MiB         428           result += chunk.choices[0].delta.content or ""
    56
    57    415.3 MiB      0.0 MiB           1       return result

INFO:     127.0.0.1:54455 - "POST /debug HTTP/1.1" 200 OK
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    45    415.3 MiB    415.3 MiB           1   @profile
    46                                         async def main_logic(query) -> str:
    47    415.3 MiB      0.0 MiB           8       stream = await router.acompletion(
    48    415.3 MiB      0.0 MiB           1           model="gpt-4o-mini",
    49    415.3 MiB      0.0 MiB           1           api_key=config.openai_api_keys[-1],
    50    415.3 MiB      0.0 MiB           1           messages=[{"role": "user", "content": query}],
    51    415.3 MiB      0.0 MiB           1           stream=True,
    52                                             )
    53    415.3 MiB      0.0 MiB           1       result = ""
    54    415.6 MiB      0.4 MiB         476       async for chunk in stream:
    55    415.5 MiB      0.0 MiB         411           result += chunk.choices[0].delta.content or ""
    56
    57    415.6 MiB      0.0 MiB           1       return result

does the memory leak occur without langfuse on ?

Yes, the memory leak occurs even when using only the code above, without Langfuse.

fyi. The graph below shows the memory usage graph when one user repeatedly sends the same request to the server running the code above. (In the test, requests were sent only after receiving a response, so there were no concurrent requests being processed.)

image