freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
535 stars 148 forks source link

Ready Mix scraper ate our memory #4077

Open mlissner opened 4 months ago

mlissner commented 4 months ago

Last night starting around 5:30PT, our memory started climbing:

image

I looked in the bulk celery worker logs and found lots of tasks getting retried. I think that happens when we enqueue too many or when they fail? I forget and I wasn't able to figure out what went wrong, but it's definitely from the ready mix project.

To fix it, I looked in redis and found two really huge keys, unacked and unacked_index. For better or worse, I I stopped the script and deleted the keys and that fixed the memory, as above.

I restarted ready-mix script, and it seems to be going along nicely again now.


Thoughts:

The script is running now. Hopefully it'll not crash things, but I'll be keeping my laptop handy in case it does.

mlissner commented 4 months ago

One other thing: Some of the retries were throttled for like 800,000 seconds!

mlissner commented 4 months ago

OK, so after running for a few more hours, the script was stuck enqueuing more and more items from hib, so I killed it. The memory is climbing. Hopefully killing the script will bring the memory back down:

image

albertisfu commented 4 months ago

I reviewed the issue and tried to reproduce it locally to detect the cause of the Redis memory exhaustion issue.

Considering the following notes:

I looked in redis and found two really huge keys, unacked and unacked_index

One other thing: Some of the retries were throttled for like 800,000 seconds!

I was able to reproduce the problem that looks pretty similar than the described above by:

Reducing the number of scraping courts to 1 and used 2 Celery workers.

Tweaked make_docket_by_iquery to run with a delay of 1 second, simulating the task duration.

Ran the command.

Immediately, I started receiving the message:

Throttling task cl.corpus_importer.tasks.make_docket_by_iquery (%s) via decorator for %ss

The number of seconds the tasks were throttled grew too fast.

The issue happening is the following:

The command uses the throttle.maybe_wait() mechanism to throttle the number of tasks sent to the Celery queue. However, if workers process tasks from the queue too quickly, new tasks are sent to refill the queue just as fast. This happens because workers were re-scheduling tasks via the throttle_task decorator instead of processing them.

Re-scheduling is a fast process where the worker picks up the task from the queue, computes the countdown, and re-schedules the task. The task is then moved to the unacked and unacked_index keys, reducing the size of the main queue. When the queue is short enough, it is refilled, causing the unacked key to grow rapidly due to tasks being re-scheduled with large countdowns.

Processing tasks for 1 court with 2 workers, where each task takes 1 second to complete, helped me to understand what’s happening:

Hypothesis for Production Issue:

This issue can only occur if more than one task for a given court is executed simultaneously, which happens often if many workers are executing tasks and the tasks take a short time to complete. Despite this, I couldn't reproduce the issue with ~230 courts, the number I believe is processed in production.

Considering the min value in the CeleryThrottle instance is 2 times the number of courts and max is 2 times min, approximately 920 courts are sent to the queue in a single cycle. Thus, a court_id can be in the queue 4 times, making it possible for a court_id to be processed more than once by two different workers simultaneously. However, this shouldn't be common. For tasks for a court_id to continue being throttled and the countdown growing, this must happen continuously for the same court_id, which is statistically unlikely. Therefore, other factors might be influencing this behavior:

Maybe the following data can help us to understand better what it happened:

from cl.lib.redis_utils import get_redis_interface
r = get_redis_interface("CACHE")
court_ids = r.hkeys("iquery_status")
print(len(court_ids))
mlissner commented 4 months ago

I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.

albertisfu commented 4 months ago

I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.

Oh yeah, this makes complete sense. As more courts are completed, the most probable tasks start to be throttled.