SQS Poller pulls in messages even if `max_number_of_messages` = 1 and previous messages are invisible

aws / aws-sdk-rails

Official repository for the aws-sdk-rails gem, which integrates the AWS SDK for Ruby with Ruby on Rails.

Other

596 stars 62 forks source link

SQS Poller pulls in messages even if `max_number_of_messages` = 1 and previous messages are invisible #95

Closed jshah closed 1 year ago

jshah commented 1 year ago

I don't have any logs to provide but I am seeing messages that are enqueued after existing messages become "not visible" get polled by SQS workers and not commence the work. It seems like they are polled and sit around waiting for something (existing messages maybe?).

I would expect that workers do not pick up any new messages while the current one is being worked on (or in other words, polled). If they are polled, they start work immediately and do not hold on to other messages. Can I confirm that this is the expected behavior?

You can see that in the picture below, we had lots of messages that were "not visible" and we queued more messages when the "message visible" was 0. That message gets polled by a worker immediately but never actually starts work. I would have expected one of the available workers to start work on it while we had hundreds of invisible messages.

alextwoods commented 1 year ago

I would expect that workers do not pick up any new messages while the current one is being worked on (or in other words, polled). If they are polled, they start work immediately and do not hold on to other messages. Can I confirm that this is the expected behavior?

This is generally true, but depends on your code and how you are managing visibility timeouts/message deletion. If messages are not being deleted after processing, then they will remain non-visible until the visibility timeout is reached. The visibility timeout (unless set in your code) will be whatever is configured on the queue - so even if a worker has finished processing a message, but hasn't deleted it, it will remain non-visible and the worker will move onto a new message.

Can you post some of your code?

Is this new behavior you are seeing (ie, a change in behavior with existing code)?

jshah commented 1 year ago

Can you post some of your code?

What do you mean by this? We are using the SQS adapter in rails (:amazon_sqs) so we just enqueue jobs and the underlying workers process them.

The command we use to start the worker is: bundle exec aws_sqs_active_job --queue privacy_request_data_operations --threads 1 --max_messages 1 --visibility_timeout 10800

We've seen this behavior consistently just the first time we're logging this bug here.

Gemfile versions:

aws-sdk-rails (3.6.1)
aws-sdk-sqs (1.48.0)

alextwoods commented 1 year ago

I'm sorry, I misunderstood the original issue and didn't get that this was using the SQS ActiveJob adapter -I had assumed you were using the SQSPoller.

My answer to your question was incorrect - For the SQSActiveJob::Poller the behavior is a bit different. The poller runs in the main thread and relies on a ThreadPoolExecutor to run jobs. The ThreadPoolExecutor works by queueing up work (when there are no available threads) and will queue until the backpressure is reached (default is 10). After that it will fallback to having the main poller thread run the job. What this means is that up to 10 messages will be picked up and queued while worker threads are not available.

If thats not desirable - you can adjust the backpressure downwards (but setting it to zero means unlimited).

jshah commented 1 year ago

thanks - that makes sense! I will update our backpressure to 1 along with max_messages to 1 to ensure only 1 message gets pulled/worked on at a time.

How does max_messages and backpressure interact together? max_messages is just the number of messages we pull down and backpressure is the number of messages we queue up?

Can max_messages be greater than backpressure? Would that mean the leftover messages get returned back to the queue with their visibility timeout?

alextwoods commented 1 year ago

max_messages is the (max) number of messages that will get polled in a single batch. backpressure is the maximum number of messages that will get added to the worker queue before the fallback_strategy is applied - our default is :caller_runs which is documented as: "The work will be executed in the thread of the caller, instead of being given to another thread in the pool". For us what that means is that the polling thread will run the job and won't pull for more messages until that job is complete. If max_messages is larger than 1, that could potentially result in messages sitting in the poller, but not being queued up/run and set to visible. So I think this is fairly reasonable behavior as long as max_messages is 1 here w/ limited backpressure.

But long term, we may want to refactor this a bit and use abort for the fallback_strategy, catch the exception in the poller and make the unprocessed message(s) in the batch visibile again, and then wait until the executor has capacity before polling SQS again.

mullermp commented 1 year ago

Did the changes work for you? Closing this, please re-open if there's questions or issues!