Open wmvsilva opened 3 months ago
Hey @wmvsilva, thanks for the detailed investigation.
The scenario you described makes sense, and indeed it would prevent further polling until the message is done processing.
We could look into enabling partial polls for LTM, e.g. if one permit is being used, we'd try to acquire all permits, and if that fails, we could try to acquire 9 permits to do a partial poll.
The main issue I see is that it's not so simple to orchestrate this with the release of the permits afterwards.
I'll need to give this some more thought - do you have any ideas?
Thanks.
Type: Bug
Component: SQS Version: 3.0.3 Config:
Describe the bug
Bug Description
Hi team,
My application uses
@SqsListener
to process messages from an SQS queue. Occasionally, it experiences an issue where no SQS messages are received or processed for several minutes, despite thousands of messages being present in the queue. I enabledDEBUG
logs forio.awspring.cloud.sqs
to investigate this problem.During the last time the issue occurred, I observed the following log every 10 seconds (but no polling logs):
Additionally, I could see logs in my application indicating that a single SQS message was actively being processed. As soon as that task finished processing, polling started up again and there were plenty of logs like:
Bug Scenario
After looking through the awspring source code, I believe the following sequence of events occurs:
CurrentThroughputMode.HIGH
modeSemaphoreBackPressureHandler
to enterCurrentThroughputMode.LOW
mode.AbstractPollingMessageSource.pollAndEmitMessages()
loop, whenSemaphoreBackPressureHandler.requestInLowThroughputMode()
is run, it tries to acquire all 121 permits. However, one permit remains unavailable due to the long-running SQS task that started earlier. BecauserequestInLowThroughputMode
cannot acquire all permits, it acquires none, resulting in no SQS polling.For step 5, I see the following logs:
I believe this happens because
SemaphoreBackPressureHandler.requestInLowThroughputMode()
uses the following code. With this approach,LOW
can only acquire permits if the fulltotalPermits
are available. While this prevents parallel SQS polling inLOW
mode, it also prevents any polling if there ongoing SQS tasks consuming permits. With fast processing of SQS messages, this would not be noticeable, but I occasionally have SQS tasks that last several minutes.Expectation
I would expect that if my application is handling an SQS message that takes several minutes to process, and
SemaphoreBackPressureHandler
is inLOW
mode, thenSqsListener
would continue to poll the queue for up to 10 messages every 10 seconds.Workaround
As a workaround, I am using
backPressureMode=FIXED_HIGH_THROUGHPUT
to prevent my application from enteringLOW
mode. However, this results in additional polling.Please let me know if there is any additional information I can provide or if any of my assumptions are incorrect.