JaidenAshmore / java-dynamic-sqs-listener

Java SQS Listener library built to be customisable and dynamic during runtime
MIT License
50 stars 12 forks source link

@FifoQueueListener retry does not work when FIFO queue has messages of same message group id #369

Open johnnycashash opened 2 years ago

johnnycashash commented 2 years ago

Using @FifoQueueListener with concurrency of 10 in spring boot application.

SQS FIFO queue has configured visibility timeout of 10 seconds and configured DLQ with max receives count 3.

First: Putting 2 messages in FIFO queue with same message group id and different deduplication id. Second: Starting Spring boot consumer @FifoQueueListener application.

3 attempts are done on 1st message dropped in queue as exception is thrown from the process. But not noticing any attempt done on 2nd message and sometimes 3 attempts(APPROXIMATE_RECEIVE_COUNT -1,2,3) done on 1st message and only one attempt done on 2nd message(APPROXIMATE_RECEIVE_COUNT - 3).

And finally both messages end up in DLQ.

Expectation should be 3 attempts done on both the messages before putting messages in DLQ.

Noticed that on application start, both messages move to inflight(means both are read in 1st poll). Maybe the poller for @FifoQueueListener keeps on increasing the receive count of 2nd message while first message attempts are going on.

Also tried increasing the visibility on sqs to 1 hour and overriding visibility to 5 seconds in @FifoQueueListener.

johnnycashash commented 2 years ago

Attaching debug logs:

Started MessageRetriever c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 0 c.j.s.b.c.ConcurrentMessageBroker : Changing concurrency from 0 to 10 c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10 c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 1

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 11 c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 2

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

c.j.s.r.b.BatchingMessageRetriever : Downloaded 2 messages c.j.s.r.b.BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 12 c.j.s.r.b.BatchingMessageRetriever : Requesting 10 messages

print logs for first message attempt 3

c.j.s.b.grouping.GroupingMessageBroker : Error processing message com.jashmore.sqs.processor.MessageProcessingException

At every attempt the receive count of 2nd message keeps incrementing without processing and both messages end up in DLQ

Maybe it an expected scenario but if both messages are consumed in single poll, then 2nd message will go to DLQ without processing and if 2nd message is read in 2nd retry poll, 2nd message is attempted twice and if 2nd message is read in some poll when all retry are finished for 1st message then 2nd message is retried 3 times.

Can there be consistency in such scenarios or this is expected/recommended from SQS FIFO queue?

JaidenAshmore commented 2 years ago

Hey, yeah so from memory if the first message fails to be processed it will not attempt the second message because that breaks the FIFO ordering. E.g in this scenario:

You can see in this scenario that you have broken the FIFO ordering if you allow the second message to be processed after the first has failed.

Are you seeing cases where the first message is being placed into the DLQ but then it does attempt the second message afterwards?

johnnycashash commented 2 years ago

Hi, Below are the scenarios: (Let me know if the consumer understanding is correct)

firstmessage - m1, secondmessage - m2 . Both are part of same message group Sent back - Becomes visible after timeout

Scenario 1: (when m1 is successfully sent to DLQ and then m2 is placed in queue)

Scenario 2: (when m1 is retried 1st time and then m2 is placed in queue)

Attaching logs for scenario 2 for clarity: Scenario 2 logs:

BatchingMessageRetriever : Started MessageRetriever BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10 BatchingMessageRetriever : Requesting 10 messages BatchingMessageRetriever : Downloaded 1 messages (m1) BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 10 BatchingMessageRetriever : Requesting 10 messages Received Message with attempt::1 (m1)

Exception for m1

BatchingMessageRetriever : Downloaded 2 messages (m1,m2) BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 11 BatchingMessageRetriever : Requesting 10 messages Received Message with attempt::2 (m1)

Exception for m1

BatchingMessageRetriever : Downloaded 2 messages (m1,m2) BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 12 BatchingMessageRetriever : Requesting 10 messages Received Message with attempt::3 (m1)

Exception for m1 POINT - details mentioned below BatchingMessageRetriever : Downloaded 1 messages (m2) BatchingMessageRetriever : Waiting for 10 requests for messages within 5000ms. Total currently waiting: 13 BatchingMessageRetriever : Requesting 10 messages Received Message with attempt::3 (m2)

Exception for m2

So if both m1 and m2 are received first time in same poll, then m2 is never retried if m1 keeps on failing and both goes to DLQ. But if m1 is received alone in 1st poll and m2 is received in next poll then m2 is retried at least once even if m1 keeps on failing and is still not put in DLQ.

Adding some thought: Maybe at POINT in log, m1 got placed in DLQ, so consumer is free to read and process next message m2

Thanks

JaidenAshmore commented 2 years ago

Hmm yeah I have been thinking about this and unsure what the best approach for this is as this is what the SQS FIFO queue allows us to do.

If you want to make sure that the second scenario cannot happen and it follows similar logic as scenario one, then I would recommend to set maximumMessagesInMessageGroup to be 1. In this case we would never pickup m1 and m2 at the same time and therefore would be practically the same as that first case.