Automated Failure Recovery for Asynchronous Consumers

Is your feature request related to a problem? Please describe.

From the 2023 roadmap, AWS deployments that rely on asynchronous Lambda triggers (e.g., AWS S3) are at risk of data loss due to retry limitations (two retry attempts are made before data is lost). We can fix this by adding support for failure destinations and automating recovery by adding additional retry attempts.

Describe the solution you'd like

Add support for the following:

Failure destinations in our Terraform configurations
Terraform module(s) that implement automated recovery via a queue (SQS is recommended, but Lambda destinations support SNS, EventBridge, and Lambda as well)

This may require some code changes, but most changes should be made via infrastructure as code.

Describe alternatives you've considered

We recommend that users use Kinesis as intermediary storage for all data pipelines -- Kinesis will retry until data expires in the stream (24 hours by default). An example of this design is here.

Additional context N/A

brexhq / substation

Automated Failure Recovery for Asynchronous Consumers #57