bentoml / BentoML

The easiest way to serve AI/ML models in production - Build Model Inference Service, LLM APIs, Multi-model Inference Graph/Pipelines, LLM/RAG apps, and more!
https://bentoml.com
Apache License 2.0
6.77k stars 767 forks source link

bug: Concurrent requests with the streaming feature produce parallel calls to the runner #4624

Open Hubert-Bonisseur opened 3 months ago

Hubert-Bonisseur commented 3 months ago

Describe the bug

To activate the streaming capability in bentoML, you require a Runnable function that yields an AsyncGenerator. Consequently, invoking this function returns promptly, regardless of ongoing computations that produce outputs. Consequently, the Runnable function is always deemed complete, initiating immediate processing for all service requests, irrespective of any ongoing computations from a prior generator. Consequently, there's no limit on the memory footprint of the runner.

To reproduce

No response

Expected behavior

The service should wait for the first AsyncGenerator to complete before requesting a new one.

A simple fix to this issue is to add a lock at the start of the runnable method:

    def __init__(self):
        self.predict_lock = threading.Lock()

    def predict(self, input) -> AsyncGenerator[str, None]:
        with self.predict_lock:
              # compute and yield whatever
               pass

I think this locking mechanism should either be implement on the side of bentoML or its necessity should be made clear in the documentation

Environment

bentoml==1.1.4

xianml commented 1 month ago

Definitely runner can run in parallel especially for the VLLM case we can do batching to enhance performance. We can not make such assumption in BentoML. BTW, If you want to control the concurrency, you can specify the max_concurrency by @bentoml.service decorator.

Of course, you can do such locking mechanism in your bento. Hope that answered your question