bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
https://bentoml.com
Apache License 2.0
7.17k stars 792 forks source link

bug: bentoml task runs progressively slower #5050

Closed rlleshi closed 1 day ago

rlleshi commented 3 weeks ago

Describe the bug

Task definition:

    @bentoml.task(
        batchable=True,
        batch_dim=(0, 0),
        max_batch_size=15,
        max_latency_ms=1000)

Normally a batch of 15 image frames will run in about 0.9s when executed synchronously. From the docs I discovered that bentoml offers running background tasks out of the box, which made me go for this instead of executing the sub-service asynchronously (as this would be the only async sub-service of the main service).

However, the execution gets progressively slower after each batch. Not only that but it seems to actually block the rest of the synchronous execution (at least for some time). Regarding the latency, it starts off executing in milliseconds (as it should), but eventually by the time we get to the final batch these are the execution times:

2024-10-29T08:20:18+0100 [INFO] [service:Service4:1] _ (scheme=http,method=GET,path=/postprocess/status,type=application/vnd.bentoml+pickle,length=50506324) (status=200,type=application/json,length=136) 9169.608ms (trace=b48c2f58c7a620a03df685f7863429f8,span=74a96aa3f968054f,sampled=0,service.name=Service4)
2024-10-29T08:20:20+0100 [INFO] [service:Service4:1] _ (scheme=http,method=GET,path=/postprocess/get,type=application/vnd.bentoml+pickle,length=50506324) (status=200,type=application/json,length=165888126) 921.565ms (trace=b48c2f58c7a620a03df685f7863429f8,span=5fda59432b2011ac,sampled=0,service.name=Service4)
2024-10-29T08:20:20+0100 [INFO] [service:Service4:1] Task(postprocess) ad42439e754744f5872db94e88aa8350 is submitted (trace=b48c2f58c7a620a03df685f7863429f8,span=f84919adaacaef22,sampled=0,service.name=Service4)
2024-10-29T08:20:20+0100 [INFO] [service:Service4:1] _ (scheme=http,method=POST,path=/postprocess/submit,type=application/vnd.bentoml+pickle,length=50506324) (status=200,type=application/json,length=69) 432.818ms (trace=b48c2f58c7a620a03df685f7863429f8,span=f84919adaacaef22,sampled=0,service.name=Service4)

that is more than 10 seconds for a batch that should take ~0.9s to process, so more than an order of magnitude! As a result of this the entire execution time more than doubles.

My main service looks sth like this:

iterate over all the frames of the image in batches
execute service 1, takes about 0.9s
execute service 2, takes about 0.1s
execute service 3, takes about 1.5s
execute service 4 (this is the task), which should take about 0.9s

So by the time the next task starts executing there is at least 2.5s of execution time (excluding overhead), which should be more than enough for the previous task to have finished their execution.

Regarding the execution of the task in the main service it looks sth like this:

self.task = self.service4.postprocess.submit(
    [BatchInput(image=frame, ...)
    for frame, ..., in zip(batch_frames, ...)
])

and then accessing it (this happens after service 1-3 finished executing)

if self.task:
    status = self.task.get_status()
    if status.value == 'failure':
        # do sth...
    else:
        results = self.task.get()
        # do sth...

Am I missing something? Also, other than the above-linked documentation page, is there more documentation I could get on tasks in bentoml?

To reproduce

No response

Expected behavior

The task should execute in the background & not block the main flow, and therefore it should in the end it should speed up the overall execution instead of slowing it down.

Environment

bentoml: 1.3.3 python: 3.9.0 platform: ubuntu 22.04, 6.5.0-45-generic

aarnphm commented 3 weeks ago

Can you provide your service definition here? obv you should remove anything that is sensitive there?

But this seems, strange.

rlleshi commented 3 weeks ago

Hey, ty for the quick follow-up.

Sure:

@bentoml.service(
    traffic={
        'timeout': 10,
        'concurrency': 15,
    },
    metrics={
        'enabled': True,
    },
    workers=2,
)
class Service4(BentoWrapper):

    def __init__(self) -> None:
        super().__init__()
        # do stuff

    @bentoml.task(
        batchable=True,
        batch_dim=(0, 0),
        max_batch_size=15,
        max_latency_ms=1000)
    def postprocess(self,
                inputs: list[BatchInput]) -> torch.Tensor:

Also updated the issue with some more info regarding the task execution from the main service.

aarnphm commented 3 weeks ago

What does this BentoWrapper class do?

rlleshi commented 3 weeks ago

Ah, I guess the name has to be improved there. It's just a base class with common functionalities shared by the services.

frostming commented 3 weeks ago

When you implement batch endpoints with sync methods, please be noted don't call the endpoint within another sync API method, because there is a default thread limit of 1.

Try increasing the number by setting threads=N in @service() decorator to see if it improves the performance.

rlleshi commented 2 weeks ago

@frostming ah I see. Well then perhaps the documentation here should be updated accordingly (namely that this is not recommended). I wanted to speed up the processing of my distributed service and tasks seemed like a way to do it (apart from async services) in bentoml.

Regarding increasing the number of threads, would you mind pointing me to the documentation where this is described please? I can't find anything of substance here or by a general search in your documentations.

rlleshi commented 2 weeks ago

@frostming @aarnphm kind reminder

If you could provide some documentation on threads it would be great (you said that a worker uses one by default, what happens if we assign more than one threads, I mean why isn't it recommended to do that?).

Also, if I cannot use tasks, what else would you recommend in this case? So, again, I have a composite service that orchestrates up to 4 sub-services. If I manage to run the fourth sub-service async then I will be able to deduct its execution time from the overall service execution time. So it will non-trivially speed up the overall service.

AFIK bento offers tasks & async services for this purpose. I refrained from using async services because the main service is a sync one and the other 3 services will happen in a sync fashion. It's only the fourth service that should happen async. Hence I went for tasks but now it seems that even bento tasks aren't quite appropriate.

If I go for the async approach can I simply use asyncio.run() for the fourth service from the main service? I'm guessing not? I guess one way, as described in the docs here, would be to make the main service an async one, then I use sub-service 1-3 as async after converting them (while the services in reality are sync), and finally use the fourth async sub-service as normally.

Otherwise, any other ideas?

rlleshi commented 3 days ago

@frostming @aarnphm another kind reminder :)

frostming commented 2 days ago

Regarding increasing the number of threads, would you mind pointing me to the documentation where this is described please? I can't find anything of substance here or by a general search in your documentations.

Lacking docs for that part

The threads limit only applies to the sync endpoint, and you shouldn't do concurrent actions inside of it. Just make it an async method and it will work

rlleshi commented 1 day ago

I wanted to refrain from doing that since most of my services are working sync but I guess there's no other choice. Thanks for the followup anyway!