film42 / sidekiq-rs

A port of sidekiq to rust using tokio
MIT License
95 stars 10 forks source link

Processor hogs the Redis connections #43

Closed spencewenski closed 2 months ago

spencewenski commented 2 months ago

When the Processor has more worker tasks than there are Redis connections available, the worker tasks hog the Redis connections and don't allow other tasks to acquire a connection. Specifically, this happens when ProcessorConfig#num_workers >= the pool's max_size (which defaults to 10). This can easily happen if running on a CPU with many cores, but can also be reproduced by setting num_workers = 1 and max_size = 1.

I'm not sure exactly why this is happening, a cursory review of the code doesn't raise any obvious issues to me. I see that the fetch with brpop has a 2 second timeout, and I see it happening multiple times while another task is waiting for a connection, so maybe connections aren't acquired fifo in bb8? Or maybe the connection isn't getting released properly after the brpop?

Sample logs (I added the fetch/brpop logs):

2024-04-24T16:32:26.984383Z  INFO http_request{http.request.method=GET http.route=/api/_health request_id=1ef25a35-e6ea-4acc-a522-001f97a9dc58}: roadster::controller::middleware::tracing: started processing request version=HTTP/1.1 url.path=/api/_health request_headers={"host": "localhost:3000", "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:125.0) Gecko/20100101 Firefox/125.0", "accept": "application/json, text/plain, */*", "accept-language": "en-US,en;q=0.5", "accept-encoding": "gzip, deflate, br", "connection": "keep-alive", "referer": "http://localhost:3000/api/_docs", "cookie": Sensitive, "sec-fetch-dest": "empty", "sec-fetch-mode": "cors", "sec-fetch-site": "same-origin", "dnt": "1", "sec-gpc": "1", "request-id": "1ef25a35-e6ea-4acc-a522-001f97a9dc58"}
2024-04-24T16:32:28.266262Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:28.268461Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:30.333330Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:30.335758Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:32.395541Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:32.397579Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:34.459856Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:34.462165Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:36.509729Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:36.511745Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:38.580146Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:38.582449Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:40.632981Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:40.635133Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:42.697636Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:42.699662Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:44.757017Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:44.758287Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:46.810364Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:46.812743Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:48.872670Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:48.874530Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:48.880517Z ERROR sidekiq::processor: Error publishing processor stats: TimedOut
2024-04-24T16:32:48.880524Z ERROR sidekiq::processor: Error in scheduled poller routine: TimedOut
2024-04-24T16:32:50.933948Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:50.934389Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:53.000805Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:53.003117Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:55.073877Z DEBUG process_one:process_one_tick_once:fetch: sidekiq::processor: fetch
2024-04-24T16:32:55.075800Z DEBUG process_one:process_one_tick_once:fetch:brpop: sidekiq::redis: brpop
2024-04-24T16:32:56.991295Z  INFO http_request{http.request.method=GET http.route=/api/_health request_id=1ef25a35-e6ea-4acc-a522-001f97a9dc58 http.response.status_code=200}: tower_http::trace::on_response: finished processing request latency=30006 ms status=200 response_headers={"content-type": "application/json", "content-length": "103", "request-id": "1ef25a35-e6ea-4acc-a522-001f97a9dc58"}

For reference, the http request handler is an Axum handler defined here.

The workaround for apps is pretty simple -- just ensure that max_size is at least 1 bigger than num_workers. However, if an app's num_workers is large (e.g. 100+), having that many redis connections open would be wasteful and could potentially cause operational issues.

I plan to dig into this this week, but if anyone has ideas of what's going on that would be great too!

spencewenski commented 2 months ago

TBD: Does this repro when there are sidekiq jobs available in redis, or only when no jobs are available?

spencewenski commented 2 months ago

I think I understand what's happening. There are two cases of interest:

  1. When there is a job in the queue, the Processor awaits on a worker to process the job, allowing tokio's task scheduler to wake up the next task waiting for a connection from the pool.
  2. When there are no jobs available, there are no awaits between when brpop returns and when the next connection is acquired from the pool. This means tokio's task scheduler doesn't have a chance to switch to another task that's waiting for a connection.

The main distinction between the two cases is the absence of an await when there are no jobs in the queue. We can resolve the connection hogging by adding a tokio::task::yield_now().await in the case where there is no actual job to handle. This allows tokio's task scheduler to wake up a different task that's waiting for a connection.

TBD: Does this repro when there are sidekiq jobs available in redis, or only when no jobs are available?

The issue does not repro when there are jobs available in the queue.

film42 commented 2 months ago

Nice find! I typed a comment about brpop stealing things (but forgot to submit) sounds like you figured that part out.

What are your thoughts on switching to a dedicated server connection pool?

spencewenski commented 2 months ago

What are your thoughts on switching to a dedicated server connection pool?

Do you mean having a separate pool for pushing to the queue vs pulling, or something else?

I actually created an issue yesterday to do the former in my project.

film42 commented 2 months ago

Exactly. It probably wouldn't be a code change to this lib, but maybe a best practices section in the README.

spencewenski commented 2 months ago

Yeah, sounds good. I think I agree that it wouldn’t be a change in this lib — the app would still need to manage the push pool. I could see maybe adding a builder utility that builds the processor and two separate pools to make it obvious to the consumer that separate pools are recommended, but starting with a best practice recommendation in the readme sounds like a good enough first step for now.