aws-powertools / powertools-lambda

MIT No Attribution
73 stars 4 forks source link

Add exponential back-off with jitter in SQS Batch Processor in case of temporary errors #22

Open aagrawal207 opened 3 years ago

aagrawal207 commented 3 years ago

Is your feature request related to a problem? Please describe. We use SQS Batch processing for Powertools and some message fails, it stays in the queue until it is received enough times that it moves to DLQ instead. The gap between processing of this failed message is equal to or greater than the SQS's set visibilityTimeout. We can override this visibilityTimeout at failure events for individual messages during exception handling.

Describe the solution you'd like I would like that SQS Batch Processor library provides this functionality by default so that we don't need to write this ourselves. This also ties nicely with another feature request of mine: awslabs/aws-lambda-powertools-roadmap#21.

Describe alternatives you've considered Writing my own utility code using the references I found on the internet like this: https://ivan-site.com/2018/06/exponential-backoff-in-sqs/.

Additional context This is a pretty common usecase.

pankajagrawal16 commented 3 years ago

Hi @as2d3 Thanks for opening the issue. It does sounds like an intresting use case to support in batch processing utility.

@heitorlessa Thoughts?

heitorlessa commented 3 years ago

Do I understand correctly that you'll want to set a different wait time per individual message that failed? e.g. set jitter on different failed messages

Then differentiate between temporal vs permanent failures to do this differently?

When you say "it's a pretty common use case", could you expand on that with examples?

At first, I see why you'd want this but I'd wait for more customers to ask this. As we don't fully control the Poller, messages could be delivered more than once, SQS could throttle depending on how many messages are being set with a different visibility, I'd have more questions in this approach once I have a clearer head.

I'd also want us to have a chat with the Lambda team - This is the responsibility of the SQS Consumer (Lambda Poller). Like Kinesis, they could accept additional context in the response that could technically do what this utility does too.

@pankajagrawal16 lets find some time to sync next week over this and have a word with the Lambda team.

Thanks a lot for raising it!!!

MartinMitro commented 1 year ago

Hey, @heitorlessa ,

I can add example: we are calling third party service, which is down a lot of times even for few hours. It does not make sense to call it every minute once it is down, but exponentially increase visibility timeout of message with maximum set up (lets say half and hour)

Then differentiate between temporal vs permanent failures to do this differently?

Yes, we still want to differentiate between permanent (remove from queue) and temporal failures (retry with back-off)