[SQS] Exceeding/Changing visibility timeout may lead to messages being processed twice in FIFO queues

otbe commented 3 years ago

Type: Bug

Component: SQS

Describe the bug Im not quite sure if this is a bug report, a feature request or just a misunderstanding on my side. So please bear with me 😇

We use a fifo queue to process "commands" that affect a 3rd party system. We use a message group id that groups commands that should be processed sequentially. One day we saw an increased latency in processing these commands and we saw exceptions like the following in our logs

com.amazonaws.services.sqs.model.AmazonSQSException: Value <receiptHandle> for parameter ReceiptHandle is invalid. Reason: The receipt handle has expired.
...
com.amazonaws.services.sqs.AmazonSQSClient.executeDeleteMessage(AmazonSQSClient.java:920)
...

The reason for this is simple, we have hit our visibility timeout. Every message (of the same receiveMessage call) after the first exception also printed the same exception.

That made me think about the implementation of the MessageGroupExecutor here. Theres https://github.com/awspring/spring-cloud-aws/pull/192 that fixes the case where we should stop processing messages whenever an exception was raised but I think theres another issue.

Imagine we call receiveMessage with the default message count of 10 and (worst case scenario) we receive 10 messages for the same message group. As soon as we receive them the default visibility timeout of the queue will start ticking. Lets say we didnt change the default, this means we have a default of 30s for each message. Really? I think no. In the scenario above we will have 30s for all 10 messages because if we spend 30s to process the first 5 messages, we wont be able to delete the sixth messages (see error above).

That basically means we have processed the message, we were not able to delete the message and it will be visible to the other consumers. Actually we will process it twice.

I can only think of some ways to work around this right now (from a userland perspective):

settings the maxMessageCount to 1 (which will affect throughput and increase cost)
always use Visibility when using fifo queues and call extend at the very beginning of your handler to force an exception if the receiptHandle has already timed out

Option 2 is basically our current approach. We have a little helper class that runs "side effects" of SQS handlers which basically will call extend as the very first action and then starts a timer task that extends the timeout every X while executing the real action. At the end we introduced an eager check if the messages is still valid in the sense of calling an SQS API 🤔

Of course this will only work after https://github.com/awspring/spring-cloud-aws/pull/192 has been merged to master/released.

Would be happy if someone with deeper knowledge could enlighten me here? Am I totally wrong with assumptions about SQS, Fifo or the implementation here? 😇

maciejwalkowiak commented 2 years ago

@otbe your concerns are valid, I am not sure at this point what is the right approach. It sounds like a general problem with SQS FIFO rather than something Spring Cloud AWS specific, but lets keep this issue open for some time, perhaps we will get some insights from the community.

maciejwalkowiak commented 1 year ago

@tomazfernandes do you have perhaps some thoughts on this?

tomazfernandes commented 1 year ago

Sure, in 3.0 simply set "messageVisibilitySeconds" and, for FIFO queues, the visibility will be extended for each message in the group before processing, not unlike the @otbe's solution if I understand correctly.

Just make sure one message doesn't take longer than the time set.

https://github.com/awspring/spring-cloud-aws/blob/0c1cec285bdceb11d1e5cfa2c9d8b292cee7c859/spring-cloud-aws-sqs/src/main/java/io/awspring/cloud/sqs/annotation/SqsListener.java#L133-L138

maciejwalkowiak commented 1 year ago

Ok so this is a non issue in 3.0, right?

tomazfernandes commented 1 year ago

Ok so this is a non issue in 3.0, right?

Yup, using this feature in 3.0, the visibility extension is applied to all unprocessed messages as each message gets processed, so it won't be a problem.

A feature we could look into in the future would be having a separate process to extend message visibility even as the message is being processed, but that seems not too idiomatic TBH. But I've seen it in an SQS Lib somewhere.

awspring / spring-cloud-aws

[SQS] Exceeding/Changing visibility timeout may lead to messages being processed twice in FIFO queues #196