Ready Mix scraper ate our memory

mlissner commented 4 months ago

Last night starting around 5:30PT, our memory started climbing:

I looked in the bulk celery worker logs and found lots of tasks getting retried. I think that happens when we enqueue too many or when they fail? I forget and I wasn't able to figure out what went wrong, but it's definitely from the ready mix project.

To fix it, I looked in redis and found two really huge keys, unacked and unacked_index. For better or worse, I I stopped the script and deleted the keys and that fixed the memory, as above.

I restarted ready-mix script, and it seems to be going along nicely again now.

Thoughts:

I don't think we'll run this script much after we get the whole thing automated, so I'm not too worried about making it work perfectly.
The tasks I deleted are a bummer. It'd be nice if I could have done that in a way that only messed up the ready mix project, not everything. Maybe we can use separate Celery DB's to separate out bulk tasks or something. Hm.
I'm not sure which tasks I deleted, but I imagine a bunch are for iquery pages that didn't get scraped but should be. Could we re-run the script from the beginning, but skip any item where we already have the pacer_case_id?

The script is running now. Hopefully it'll not crash things, but I'll be keeping my laptop handy in case it does.

mlissner commented 4 months ago

One other thing: Some of the retries were throttled for like 800,000 seconds!

mlissner commented 4 months ago

OK, so after running for a few more hours, the script was stuck enqueuing more and more items from hib, so I killed it. The memory is climbing. Hopefully killing the script will bring the memory back down:

albertisfu commented 4 months ago

I reviewed the issue and tried to reproduce it locally to detect the cause of the Redis memory exhaustion issue.

Considering the following notes:

I looked in redis and found two really huge keys, unacked and unacked_index

One other thing: Some of the retries were throttled for like 800,000 seconds!

I was able to reproduce the problem that looks pretty similar than the described above by:

Reducing the number of scraping courts to 1 and used 2 Celery workers.

Tweaked make_docket_by_iquery to run with a delay of 1 second, simulating the task duration.

Ran the command.

Immediately, I started receiving the message:

Throttling task cl.corpus_importer.tasks.make_docket_by_iquery (%s) via decorator for %ss

The number of seconds the tasks were throttled grew too fast.

The issue happening is the following:

The command uses the throttle.maybe_wait() mechanism to throttle the number of tasks sent to the Celery queue. However, if workers process tasks from the queue too quickly, new tasks are sent to refill the queue just as fast. This happens because workers were re-scheduling tasks via the throttle_task decorator instead of processing them.

Re-scheduling is a fast process where the worker picks up the task from the queue, computes the countdown, and re-schedules the task. The task is then moved to the ~~unacked~~ and unacked_index keys, reducing the size of the main queue. When the queue is short enough, it is refilled, causing the unacked key to grow rapidly due to tasks being re-scheduled with large countdowns.

Processing tasks for 1 court with 2 workers, where each task takes 1 second to complete, helped me to understand what’s happening:

The command sends two tasks for the same court to the queue. Each worker takes one task and executes them almost simultaneously. Only one worker (the first to pick up the task) will be able to execute the task due to the throttle_task rate defined as 1/s. When the other worker picks up the task, the throttle_key has not expired yet, so the task is throttled by 1 second.
The command sends 2 more tasks. If the throttle_key has expired, one task will be executed, and the other one throttled by 2 seconds. Since the worker is immediately free, it picks up another task from the queue, which gets refilled. This process continues, causing the throttled countdown to grow rapidly to massive values.

Hypothesis for Production Issue:

This issue can only occur if more than one task for a given court is executed simultaneously, which happens often if many workers are executing tasks and the tasks take a short time to complete. Despite this, I couldn't reproduce the issue with ~230 courts, the number I believe is processed in production.

Considering the min value in the CeleryThrottle instance is 2 times the number of courts and max is 2 times min, approximately 920 courts are sent to the queue in a single cycle. Thus, a court_id can be in the queue 4 times, making it possible for a court_id to be processed more than once by two different workers simultaneously. However, this shouldn't be common. For tasks for a court_id to continue being throttled and the countdown growing, this must happen continuously for the same court_id, which is statistically unlikely. Therefore, other factors might be influencing this behavior:

Many retries due to PacerLoginException or RedisConnectionError for some court_ids coinciding with regular tasks being executed, causing them to start throttling.
Tasks for some court_ids finishing too quickly, possibly due to timeouts or different RequestException, leading to task retries and allowing workers to pick up more tasks from the queue, increasing the likelihood of tasks for the same court being executed simultaneously.
An error or race condition related to Redis preventing the throttle_key from expiring as expected, causing all tasks for a court to get throttled.

Maybe the following data can help us to understand better what it happened:

How many workers did the queue to execute the command have?
How many courts were being processing?

from cl.lib.redis_utils import get_redis_interface
r = get_redis_interface("CACHE")
court_ids = r.hkeys("iquery_status")
print(len(court_ids))

mlissner commented 4 months ago

I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.

albertisfu commented 4 months ago

I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.

Oh yeah, this makes complete sense. As more courts are completed, the most probable tasks start to be throttled.

freelawproject / courtlistener

Ready Mix scraper ate our memory #4077

Hypothesis for Production Issue: