Closed spencewenski closed 2 months ago
TBD: Does this repro when there are sidekiq jobs available in redis, or only when no jobs are available?
I think I understand what's happening. There are two cases of interest:
Processor
await
s on a worker to process the job, allowing tokio's task scheduler to wake up the next task waiting for a connection from the pool.await
s between when brpop
returns and when the next connection is acquired from the pool. This means tokio's task scheduler doesn't have a chance to switch to another task that's waiting for a connection.The main distinction between the two cases is the absence of an await
when there are no jobs in the queue. We can resolve the connection hogging by adding a tokio::task::yield_now().await
in the case where there is no actual job to handle. This allows tokio's task scheduler to wake up a different task that's waiting for a connection.
TBD: Does this repro when there are sidekiq jobs available in redis, or only when no jobs are available?
The issue does not repro when there are jobs available in the queue.
Nice find! I typed a comment about brpop stealing things (but forgot to submit) sounds like you figured that part out.
What are your thoughts on switching to a dedicated server connection pool?
What are your thoughts on switching to a dedicated server connection pool?
Do you mean having a separate pool for pushing to the queue vs pulling, or something else?
I actually created an issue yesterday to do the former in my project.
Exactly. It probably wouldn't be a code change to this lib, but maybe a best practices section in the README.
Yeah, sounds good. I think I agree that it wouldn’t be a change in this lib — the app would still need to manage the push pool. I could see maybe adding a builder utility that builds the processor and two separate pools to make it obvious to the consumer that separate pools are recommended, but starting with a best practice recommendation in the readme sounds like a good enough first step for now.
When the
Processor
has more worker tasks than there are Redis connections available, the worker tasks hog the Redis connections and don't allow other tasks to acquire a connection. Specifically, this happens whenProcessorConfig#num_workers
>= the pool'smax_size
(which defaults to10
). This can easily happen if running on a CPU with many cores, but can also be reproduced by settingnum_workers = 1
andmax_size = 1
.I'm not sure exactly why this is happening, a cursory review of the code doesn't raise any obvious issues to me. I see that the fetch with
brpop
has a 2 second timeout, and I see it happening multiple times while another task is waiting for a connection, so maybe connections aren't acquired fifo in bb8? Or maybe the connection isn't getting released properly after thebrpop
?Sample logs (I added the fetch/brpop logs):
For reference, the http request handler is an Axum handler defined here.
The workaround for apps is pretty simple -- just ensure that
max_size
is at least 1 bigger thannum_workers
. However, if an app'snum_workers
is large (e.g. 100+), having that many redis connections open would be wasteful and could potentially cause operational issues.I plan to dig into this this week, but if anyone has ideas of what's going on that would be great too!