bricks-cloud / BricksLLM

🔒 Enterprise-grade API gateway that helps you monitor and impose cost or rate limits per API key. Get fine-grained access control and monitoring per user, application, or environment. Supports OpenAI, Azure OpenAI, Anthropic, vLLM, and open-source LLMs.
https://trybricks.ai/
MIT License
901 stars 61 forks source link

Investigate issues related to the streaming mode #23

Closed spikelu2016 closed 10 months ago

spikelu2016 commented 10 months ago
noticed that [streaming mode](https://platform.openai.com/docs/api-reference/streaming) for chat completions is much less fluent when using the proxy when compared to the normal OpenAI API. The normal API delivers many (5-10 I guess) chunks per second to the client while the proxy seems to update the response only once per second without caring about individual chunks. Would it be complicated to fix that?

(I only took at short look at the [implementation 1](https://github.com/bricks-cloud/BricksLLM/blob/325f1d88315411e75ac9aadf7c96b468b37eb66e/internal/server/web/proxy.go#L770-L826), maybe the buffer size is too large or synchronous cost estimation takes too much time? Of course I don’t have a deeper knowledge of your codebase, even though it appears nice to read :slight_smile: )