BerriAI / litellm

Python SDK, Proxy Server to call 100+ LLM APIs using the OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, Replicate, Groq]
https://docs.litellm.ai/docs/
Other
12.31k stars 1.43k forks source link

[Feature]: S3 Scratch Bucket + 303 Redirect for responses #3684

Closed Manouchehri closed 3 months ago

Manouchehri commented 4 months ago

The Feature

Instead of returning the response to the user, upload the response quickly to a fast S3 bucket (like GCS or R2), and return a presigned URL to the client. This would only work for non-streaming responses.

This has been supported in OpenAI's python client since: https://github.com/openai/openai-python/pull/1100. Following redirects with fetch in JavaScript is a default thing.

PoC:

from fastapi import FastAPI
from fastapi.responses import RedirectResponse

app = FastAPI()

@app.post("/chat/completions")
async def redirect_to_webhook():
    return RedirectResponse(url="https://webhook.site/removed-removed-removed-removed-removed", status_code=303)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="localhost", port=8000)

Then when using:

#!/usr/bin/env python3.11
# -*- coding: utf-8 -*-
# Author: David Manouchehri

import asyncio
import openai
import logging

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)

c_handler = logging.StreamHandler()
logger.addHandler(c_handler)

client = openai.AsyncOpenAI(
    api_key="FAKE",
    base_url="http://localhost:8000",
)

async def main():
    response = await client.chat.completions.create(
        model="gemini-1.5-pro-preview-0409",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "What’s in this image?"
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                        }
                    }
                ]
            }
        ],
        temperature=0.0,
    )

    logger.info(response)

if __name__ == "__main__":
    asyncio.run(main())

Then this results in a GET request with these headers:

connection: close
x-stainless-async: async:asyncio
x-stainless-runtime-version: 3.11.9
x-stainless-runtime: CPython
x-stainless-arch: arm64
x-stainless-os: MacOS
x-stainless-package-version: 1.28.0
x-stainless-lang: python
user-agent: AsyncOpenAI/Python 1.28.0
content-type: application/json
accept: application/json
accept-encoding: gzip, deflate, br
host: webhook.site
content-length: 
Content-Type: application/json

Note to self, do not presigned the GET URL with the authorization header.

Motivation, pitch

For large responses:

  1. This might reduce the load put on LiteLLM.
  2. If there's a slow client, this would allow LiteLLM to not need to maintain a connection. e.g. making scaling with serverless platforms more efficient.

Twitter / LinkedIn details

https://www.linkedin.com/in/davidmanouchehri/

Manouchehri commented 3 months ago

Not worth it.