awslabs / aws-dotnet-messaging

An AWS-native framework that simplifies the development of .NET message processing applications that use AWS services, such as SQS, SNS, and EventBridge.
https://awslabs.github.io/aws-dotnet-messaging/
Apache License 2.0
76 stars 15 forks source link

ILambdaMessaging: Reset message visibility timeout when handler returns failed status #150

Open brendonparker opened 2 months ago

brendonparker commented 2 months ago

Describe the feature

Not sure if this a feature or a bug.

It appears there is some handling for resetting the visibility timeout when using the SQS Poller. However, the same feature doesn't exist when using the ILambdaMessaging.ProcessLambdaEventWithBatchResponseAsync.

Use Case

When the handler returns a Failed status, the message visibility should be reset, so that it can be immediately re-attempted and doesn't block that message group from processing the next message.

Proposed Solution

No response

Other Information

Using FIFO queue

Acknowledgements

AWS.Messaging (or related) package versions

AWS.Messaging.Lambda 0.9.1

Targeted .NET Platform

.NET 8

Operating System and version

AmazonLinux/Lambda

brendonparker commented 2 months ago

Actually, it looks like this behavior is already called out as a potential TODO here:

awslabs/aws-dotnet-messaging/docs/docs/design/message-visibility-timeout-handling.md

Having some configurability around this would be nice. I can probably work around it in the short term, by manually changing the visibility back to zero.

brendonparker commented 2 months ago

This is my current work-around. Rolling my own handling of this, Would be nice to be handled by "the framework".

public async Task<SQSBatchResponse> HandleAsync(SQSEvent evnt, ILambdaContext context)
{
    using var scope = Host.Services.CreateScope();
    var lambdaMessaging = scope.ServiceProvider.GetRequiredService<ILambdaMessaging>();
    var batchResponse = await lambdaMessaging.ProcessLambdaEventWithBatchResponseAsync(evnt, context);
    await ChangeMessageVisibilityForFailures(evnt, scope, batchResponse);
    return batchResponse;
}

private static async Task ChangeMessageVisibilityForFailures(SQSEvent evnt, IServiceScope scope, SQSBatchResponse batchResponse)
{
    var failureCount = batchResponse.BatchItemFailures.Count;
    if (failureCount == 0) return;

    using var sqs = scope.Resolve<IAmazonSQS>();
    var lookup = evnt.Records.ToDictionary(x => x.MessageId);

    if (failureCount == 1)
    {
        await sqs.ChangeMessageVisibilityAsync(new()
        {
            QueueUrl = AwsMessagingExtensions.MainQueueUrl,
            ReceiptHandle = lookup[batchResponse.BatchItemFailures[0].ItemIdentifier].ReceiptHandle,
            VisibilityTimeout = 0
        });
        return;
    }

    await sqs.ChangeMessageVisibilityBatchAsync(new()
    {
        QueueUrl = AwsMessagingExtensions.MainQueueUrl,
        Entries = batchResponse.BatchItemFailures
            .Select(x => new ChangeMessageVisibilityBatchRequestEntry
            {
                Id = x.ItemIdentifier,
                ReceiptHandle = lookup[x.ItemIdentifier].ReceiptHandle,
                VisibilityTimeout = 0
            }).
            ToList()
    });
}
ashishdhingra commented 2 months ago

Needs review with the team.

normj commented 2 months ago

I wouldn't by default reset visibility to 0 if the handler failed. The failure that would be successful in a retry is often due to a transit dependency issue that might need sometime to recover. But I can see use cases where you do want to retry right away especially in a FIFO environment so you aren't blocking other messages in the group like you said.

brendonparker commented 2 months ago

Fair point that a default of 0 may not be best. But some OOTB configurability around what that number is would fit the bill.

May not be best practice, but I've typically setup my default visibility timeout to mirror my lambda timeout (15 minutes), with a MaxReceiveCount to 1 or 2 and reset visibility to zero to retry effectively immediately. Otherwise with the 15 minute timeout you get some pretty big backups in the queue.

brendonparker commented 2 months ago

Happy to take a stab at a PR if you're open to it. But don't want to spend time on it if you'd like to leave it out.

normj commented 2 months ago

I'm happy to take a PR for an opt in feature to turn this on.