aws-powertools / powertools-lambda

MIT No Attribution
73 stars 4 forks source link

Ability to distinguish Temporary and Permanent Exceptions in SQS Batch Processing #21

Open aagrawal207 opened 3 years ago

aagrawal207 commented 3 years ago

Is your feature request related to a problem? Please describe. While processing SQS messages in batch, I want to distinguish between Temporary and Permanent Exceptions so that in case of temporary failures, message return to the main queue and in case of permanent failure, it moves to either DLQ or deleted completely as per the requirement.

Describe the solution you'd like Simple conversion of exception thrown from SqsMessageHandler implementation from their respective exception to TemporaryFailureException or PermanentFailureException with some defined logic. Then handler logic defined as well for those exceptions.

Describe alternatives you've considered Writing my own alternative exception handler which I am still figuring out.

Additional context This seems to me a fairly common use-case and should be a well-understood problem.

pankajagrawal16 commented 3 years ago

Hi @as2d3 Thanks for opening the feature request. I am wondering what would the UX be like? If I understand correctly, Intention here is to predict which of the exceptions thrown while processing an event is temporary or permanent?

Since API don't control what is the logic inside implementation of SqsMessageHandler, it's difficult to predict for the library which ones are permanent/temporary.

Could you please elaborate a bit on this? With some examples?

aagrawal207 commented 3 years ago

Hi @pankajagrawal16, currently SQS Batch processing throws only SQSBatchProcessingException. Basically, I am two new exceptions instead of this old one, something like SQSBatchProcessingTemporaryException and SQSBatchProcessingPermanentException.

So, currently I have implemented my own exceptions like these:

public class NotificationValidationException extends SQSBatchProcessingException {
    public NotificationValidationException(final String message, final Throwable cause) {
        super(message, cause);
    }
}

and

public class MyServiceRetryableException extends SQSBatchProcessingException {
    public NotificationValidationException(final String message, final Throwable cause) {
        super(message, cause);
    }
}

My SQS processor throws NotificationValidationException and MyServiceRetryableException instead of SQSBatchProcessingException. My batch processing Lambda Handler catches SQSBatchProcessingException like this:

public String handleRequest(final SQSEvent sqsEvent, final Context context) {
        try {
            LOGGER.debug("Processing SQS message batch with SQSEvent: {}", sqsEvent);
            awsLambdaPowertoolsSqsUtilsWrapper.batchProcessor(sqsEvent, notificationProcessor);
        } catch (SQSBatchProcessingException sqsBatchProcessingException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed: {}", failures);
            throw sqsBatchProcessingException;
        }

        return "{\"statusCode\": 200}";
}

The issue is that after the handler processes the batch, all the failed messages move back to main queue. Instead the logic could be like this:

public String handleRequest(final SQSEvent sqsEvent, final Context context) {
        try {
            LOGGER.debug("Processing SQS message batch with SQSEvent: {}", sqsEvent);
            awsLambdaPowertoolsSqsUtilsWrapper.batchProcessor(sqsEvent, notificationProcessor);
        } catch (SQSBatchProcessingTemporaryException sqsBatchProcessingTemporaryException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingTemporaryException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed, will be moved back to main queue: {}", failures);
            throw sqsBatchProcessingTemporaryException;
        } catch (SQSBatchProcessingPermanentException sqsBatchProcessingPermanentException) {
            final List<SQSEvent.SQSMessage> failures = sqsBatchProcessingPermanentException.getFailures();
            LOGGER.error("(Partial) batch of SQS messages failed, will be moved to DLQ directly: {}", failures);
            throw sqsBatchProcessingPermanentException;
        }

        return "{\"statusCode\": 200}";
}

Here, the ask is basically, if the message schema validation type exception is thrown, then there is no reason to move it back to the main queue and process it again later. We would like to directly move these to DLQ or delete it.

pankajagrawal16 commented 3 years ago

Thanks for the response @as2d3. I understand what is the intent of the feature request now.

That being said, it will be too much assumption on behalf of library to classify which of the errors/exceptions should be retried and which one should not be. That's why we provide generic implementation with SQSBatchProcessingException wrapping all the details of exception which is thrown by user provided implemenation of SqsMessageHandler.

So in your use case, the SQSBatchProcessingException still has all the underlying exception for you to decide what should be done with the SqsEvent next. It might be, you will like to move some to DLQ or want them to be retried.

I guess one way to support this use case better will be to accept from user which exception types are classified as "Move to DLQ exception" and then the utility can take care of moving those messages to DLQ instead of letting them being retried.

aagrawal207 commented 3 years ago

I just wanted to present an example, design could be something else. What I really want from the library is to provide an easy way to move the failed message directly to DLQ instead of moving it back to the main queue. I think this is a very useful use-case and (imho) is a must feature of an SQS consumer library.

Library does not need to define temporary and permanent exceptions I would say, but there should be a simple utility to move the message to DLQ or delete it completely instead which can be used in exception handling at implementation level.

pankajagrawal16 commented 3 years ago

I just wanted to present an example, design could be something else. What I really want from the library is to provide an easy way to move the failed message directly to DLQ instead of moving it back to the main queue. I think this is a very useful use-case and (imho) is a must feature of an SQS consumer library.

Library does not need to define temporary and permanent exceptions I would say, but there should be a simple utility to move the message to DLQ or delete it completely instead which can be used in exception handling at implementation level.

Rite, I definitely agree with that. It will be an excellent feature to be supported as part of this utility.

pankajagrawal16 commented 3 years ago

@heitorlessa I brought this issue to roadmap now, Since i believe this might be an interesting use case to support in both python and java version as both has SQS batch processor utility.

Idea here is to provide an easy way where user can configure utility to mark certain failures while processing a SQS message in the batch to be moved to a DLQ.

Initial thought was maybe we could have api accept a list of user supplied exception, which when thrown while processing the message, utility can move those messages to DLQ instead.

Let me know your thoughts?

heitorlessa commented 3 years ago

Just moved to backlog as we should do it. @cakepietoast and I discussed in the original RFC but didn't come around to do it.

That said, @pankajagrawal16 do you wanna create a RFC so we can discuss some implementation details? We can link to this issue, then tackle that until TS and C# work on their core implementations.

This will help @saragerion @sliedig @t1agob to chime in as Exceptions don't translate well in TS, for example.

pankajagrawal16 commented 3 years ago

@heitorlessa Sounds good. I will do that sometime today. I have some ideas on how this can look from UX perspective in Java.

heitorlessa commented 3 years ago

Perfect. Gimme a ping so I don't miss :D I can add the Python bit later

On Tue, 27 Jul 2021 at 09:10, Pankaj Agrawal @.***> wrote:

@heitorlessa https://github.com/heitorlessa Sounds good. I will do that sometime today. I have some ideas on how this can look from UX perspective in Java.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/awslabs/aws-lambda-powertools-roadmap/issues/21#issuecomment-887268939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZPQBCBTMVZPJBV5YKYAYTTZZLVRANCNFSM4736MPDQ .

pankajagrawal16 commented 3 years ago

@heitorlessa Here you go https://github.com/awslabs/aws-lambda-powertools-roadmap/issues/29

heitorlessa commented 3 years ago

cc @saragerion @cakepietoast as I added all_runtimes tag

pankajagrawal16 commented 3 years ago

@as2d3 this is now available for Java https://awslabs.github.io/aws-lambda-powertools-java/utilities/batch/#move-non-retryable-messages-to-a-dead-letter-queue.

https://github.com/awslabs/aws-lambda-powertools-java/releases/tag/v1.7.3

Try it out and let us know in case of any issues.

aagrawal207 commented 2 years ago

Hi @pankajagrawal16,

Thanks for implementing the feature request so quickly. I tried the change recently and have few pointers:

  1. When an SQS message comes under the nonRetryableException list, the exception for it is consumed without any trace for it even when suppressException is false. I think there should be a way to know which of the messages moved to DLQ and why. I understand the intention is to safely exit the function if the messages are moved to DLQ or deleted but then we might want to give that data to upstream which is currently not possible.
  2. In Java implementation, use of generic and varargs combination (Class<? extends Exception>... nonRetryableExceptions) is generally not preferred. I understand that this might be done to keep methods backward compatible but a better way would have been to introduce a list of exceptions instead.
  3. The list of arguments and return type might keep increasing as more and more feature requests like this will come in future. Instead of including more and more arguments, preferred way would be to introduce a single request and response object and put the attributes in that. Currently, there are too many boolean requests which make the method very ugly and difficult to use. Also, if you see the new 2.0 version AwsJavaSdk, they also make a single request and response object instead of take all the attributes of the request individually.

My usecase: I am invoking my Lambda function directly from my integration tests for both positive happy paths and negative paths as well. For the negative paths, I configured the messages to be completely deleted which is happening as per the requirement. This issue I am facing is that the exception for which the message is deleted is not reported back to the user, and so I am unable to validate the same in my integration tests. The only way the user can validate it is by checking logs manually for the deleted message.

I would love to know your thoughts on this.