aws-samples / aws-stepfunctions-examples

AWS Step Functions is an orchestration service for reliably executing multi-step processes using visual workflows. This repository includes detailed examples that will help you unlock the power of serverless workflow.
MIT No Attribution
226 stars 86 forks source link

[Feature request] Rate/warmup limit on concurrency semaphore example #5

Closed athewsey closed 2 years ago

athewsey commented 3 years ago

Hi all and thanks for the interesting DDB semaphore concurrency control example!

I'm looking at some potential use-cases (particularly relating to Amazon Textract) where there is a concurrency limit I'd like to enforce but also a rate limit on filling up the available concurrency.

For example let's say a process can support a maximum concurrency of 100, but can only start executions at approximately 1-2 per second... In an environment where sometimes the base state machine could be triggered in bulk with batches >>100.

Looking at the sample, it seems like maybe this could be worked in... Perhaps if the conditional DDB lock acquisition update was extended to also enforce a constraint based on the current $$.State.EnteredTime and the stored most recent lock acquired time?

From a quick skim of the docs I'm thinking it should be possible to do a simple "time-since-last-grant" check, but probably not possible to go for a full token bucket kind of scheme within the constraints of DynamoDB conditional update expressions. Would appreciate your thoughts!

JustinCallison commented 2 years ago

@athewsey my first thought here is that the best way to handle rate limits is with throttling and retry. If you were using this example as a wrapper on starting Textract jobs, then the workflow you embed in the middle of the example would just handle throttles from Textract and retry with exponential back-off. Step Functions makes that easy with the Retry. This optimistic approach to rate limits is generally simpler and well supported with AWS services, so I wouldn't recommend trying to proactively manage and limit it. Concurrency is more challenging to control, which is why I put this example together.

athewsey commented 2 years ago

Thanks for the response @JustinCallison!

For Textract in particular we were facing some added challenges because it's a multi-step process of:

I think actually you're right that it should be possible to manage these processes well enough with downstream retries and back-off. But at the time, we had some concerns about how that would behave for Textract and the callback pattern in particular.

...So we went ahead and implemented a version of the pattern here in the Textract sample that can either implement the vanilla pattern or (if you provide a warmup_tps_limit) add the warmup TPS constraint.

Sharing here in case it's helpful for anybody to find a (Python) CDK implementation of the semaphore pattern, and would welcome any feedback/issues you folks have on it!