awslabs / aws-solutions-constructs

The AWS Solutions Constructs Library is an open-source extension of the AWS Cloud Development Kit (AWS CDK) that provides multi-service, well-architected patterns for quickly defining solutions
https://docs.aws.amazon.com/solutions/latest/constructs/
Apache License 2.0
1.23k stars 246 forks source link

New Pattern: add Stepfunction Callback with the Task Token (Service Integration Pattern) #248

Open knihit opened 3 years ago

knihit commented 3 years ago

Step function has a callback pattern called 'Callback with Task Token' - https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html. It allows for slower backends to respond rather than getting throttled and fail.

Use Case

With some of the AI services like Rek & Comprehend, the TPS soft-limits may not be able to handle bursts or volume of streaming data. Having step function workflow, retries and data streams like MSK or Kinesis does reduce the throttling issues. But does not solve the problem where a step function task has a slow backend.

Proposed Solution

Using the callback patttern, makes the stepfunction task wait for a response (through a task token) before proceeding to the next state. Proposing to to use an SQS queue with a lambda consumer that calls the service and return the response. Using lambda's concurrency and batching messages in the queue provides a great control on handling throttling issues.

As part of this construct, the proposal is to create a statemachine fragment (https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-stepfunctions.StateMachineFragment.html) that would allow it to be chained to a state machine definition.

Other

https://zaccharles.medium.com/async-callbacks-with-aws-step-functions-and-task-tokens-9df97fe8973c https://docs.aws.amazon.com/step-functions/latest/dg/connect-to-resource.html https://docs.aws.amazon.com/cdk/api/latest/docs/@aws-cdk_aws-stepfunctions.StateMachineFragment.html


This is a :rocket: Feature Request

biffgaut commented 3 years ago

While this is not an area we can focus on in the near future, we'll add it to the backlog because the use case is something we'd be interested in for the future. Here's my understanding after further chats with Nihit. Much of this is restatement of what Nihit says above that I didn't fully grok the first time:

So the use case is to have the state machine put the task in an SQS queue then pause the state machine to wait for notification that the async task has finished. A throttled Lambda function removes the task from the queue when concurrency allows, executes the task, then calls BACK to the state machine with the completion status of the task so the state machine can continue.

We're still trying to determine if this entire use case is a Solution Construct, or something that can be built with multiple Solutions Constructs. It does illustrate that we have no constructs for stepfunctions-awsservice, like stepfunctions-lambda, stepfunctions-sqs, an area where Nihit will be entering a new issue to start to address.

We're going to start with creating the smaller constructs that could create this use case, and add this to our backlog to consider in the future when/if we start creating more targeted, less granular Solutions Constructs.