Closed jshah closed 1 year ago
I would expect that workers do not pick up any new messages while the current one is being worked on (or in other words, polled). If they are polled, they start work immediately and do not hold on to other messages. Can I confirm that this is the expected behavior?
This is generally true, but depends on your code and how you are managing visibility timeouts/message deletion. If messages are not being deleted after processing, then they will remain non-visible until the visibility timeout is reached. The visibility timeout (unless set in your code) will be whatever is configured on the queue - so even if a worker has finished processing a message, but hasn't deleted it, it will remain non-visible and the worker will move onto a new message.
Can you post some of your code?
Is this new behavior you are seeing (ie, a change in behavior with existing code)?
Can you post some of your code?
What do you mean by this? We are using the SQS adapter in rails (:amazon_sqs
) so we just enqueue jobs and the underlying workers process them.
The command we use to start the worker is:
bundle exec aws_sqs_active_job --queue privacy_request_data_operations --threads 1 --max_messages 1 --visibility_timeout 10800
We've seen this behavior consistently just the first time we're logging this bug here.
Gemfile versions:
aws-sdk-rails (3.6.1)
aws-sdk-sqs (1.48.0)
I'm sorry, I misunderstood the original issue and didn't get that this was using the SQS ActiveJob adapter -I had assumed you were using the SQSPoller.
My answer to your question was incorrect - For the SQSActiveJob::Poller the behavior is a bit different. The poller runs in the main thread and relies on a ThreadPoolExecutor to run jobs. The ThreadPoolExecutor works by queueing up work (when there are no available threads) and will queue until the backpressure is reached (default is 10). After that it will fallback to having the main poller thread run the job. What this means is that up to 10 messages will be picked up and queued while worker threads are not available.
If thats not desirable - you can adjust the backpressure downwards (but setting it to zero means unlimited).
thanks - that makes sense! I will update our backpressure
to 1 along with max_messages
to 1 to ensure only 1 message gets pulled/worked on at a time.
How does max_messages
and backpressure
interact together? max_messages
is just the number of messages we pull down and backpressure
is the number of messages we queue up?
Can max_messages
be greater than backpressure
? Would that mean the leftover messages get returned back to the queue with their visibility timeout?
max_messages
is the (max) number of messages that will get polled in a single batch. backpressure
is the maximum number of messages that will get added to the worker queue before the fallback_strategy
is applied - our default is :caller_runs which is documented as: "The work will be executed in the thread of the caller, instead of being given to another thread in the pool". For us what that means is that the polling thread will run the job and won't pull for more messages until that job is complete. If max_messages
is larger than 1, that could potentially result in messages sitting in the poller, but not being queued up/run and set to visible. So I think this is fairly reasonable behavior as long as max_messages
is 1 here w/ limited backpressure.
But long term, we may want to refactor this a bit and use abort
for the fallback_strategy
, catch the exception in the poller and make the unprocessed message(s) in the batch visibile again, and then wait until the executor has capacity before polling SQS again.
Did the changes work for you? Closing this, please re-open if there's questions or issues!
I don't have any logs to provide but I am seeing messages that are enqueued after existing messages become "not visible" get polled by SQS workers and not commence the work. It seems like they are polled and sit around waiting for something (existing messages maybe?).
I would expect that workers do not pick up any new messages while the current one is being worked on (or in other words, polled). If they are polled, they start work immediately and do not hold on to other messages. Can I confirm that this is the expected behavior?
You can see that in the picture below, we had lots of messages that were "not visible" and we queued more messages when the "message visible" was 0. That message gets polled by a worker immediately but never actually starts work. I would have expected one of the available workers to start work on it while we had hundreds of invisible messages.