Open mlissner opened 4 months ago
One other thing: Some of the retries were throttled for like 800,000 seconds!
OK, so after running for a few more hours, the script was stuck enqueuing more and more items from hib
, so I killed it. The memory is climbing. Hopefully killing the script will bring the memory back down:
I reviewed the issue and tried to reproduce it locally to detect the cause of the Redis memory exhaustion issue.
Considering the following notes:
I looked in redis and found two really huge keys,
unacked
andunacked_index
One other thing: Some of the retries were throttled for like 800,000 seconds!
I was able to reproduce the problem that looks pretty similar than the described above by:
Reducing the number of scraping courts to 1 and used 2 Celery workers.
Tweaked make_docket_by_iquery
to run with a delay of 1 second, simulating the task duration.
Ran the command.
Immediately, I started receiving the message:
Throttling task cl.corpus_importer.tasks.make_docket_by_iquery (%s) via decorator for %ss
The number of seconds the tasks were throttled grew too fast.
The issue happening is the following:
The command uses the throttle.maybe_wait()
mechanism to throttle the number of tasks sent to the Celery queue. However, if workers process tasks from the queue too quickly, new tasks are sent to refill the queue just as fast. This happens because workers were re-scheduling tasks via the throttle_task
decorator instead of processing them.
Re-scheduling is a fast process where the worker picks up the task from the queue, computes the countdown, and re-schedules the task. The task is then moved to the unacked and unacked_index
keys, reducing the size of the main queue. When the queue is short enough, it is refilled, causing the unacked
key to grow rapidly due to tasks being re-scheduled with large countdowns.
Processing tasks for 1 court with 2 workers, where each task takes 1 second to complete, helped me to understand what’s happening:
throttle_task
rate defined as 1/s. When the other worker picks up the task, the throttle_key
has not expired yet, so the task is throttled by 1 second.throttle_key
has expired, one task will be executed, and the other one throttled by 2 seconds. Since the worker is immediately free, it picks up another task from the queue, which gets refilled. This process continues, causing the throttled countdown to grow rapidly to massive values.This issue can only occur if more than one task for a given court is executed simultaneously, which happens often if many workers are executing tasks and the tasks take a short time to complete. Despite this, I couldn't reproduce the issue with ~230 courts, the number I believe is processed in production.
Considering the min
value in the CeleryThrottle
instance is 2 times the number of courts and max is 2 times min, approximately 920 courts are sent to the queue in a single cycle. Thus, a court_id can be in the queue 4 times, making it possible for a court_id to be processed more than once by two different workers simultaneously. However, this shouldn't be common. For tasks for a court_id to continue being throttled and the countdown growing, this must happen continuously for the same court_id, which is statistically unlikely. Therefore, other factors might be influencing this behavior:
PacerLoginException
or RedisConnectionError
for some court_ids coinciding with regular tasks being executed, causing them to start throttling.RequestException
, leading to task retries and allowing workers to pick up more tasks from the queue, increasing the likelihood of tasks for the same court being executed simultaneously.throttle_key
from expiring as expected, causing all tasks for a court to get throttled.Maybe the following data can help us to understand better what it happened:
from cl.lib.redis_utils import get_redis_interface
r = get_redis_interface("CACHE")
court_ids = r.hkeys("iquery_status")
print(len(court_ids))
I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.
I think there's a simpler answer for prod: As the loop processes and completes courts, it removes them from the loop. So you start with ~200 courts, but when you get to the end, there's only 1.
Oh yeah, this makes complete sense. As more courts are completed, the most probable tasks start to be throttled.
Last night starting around 5:30PT, our memory started climbing:
I looked in the bulk celery worker logs and found lots of tasks getting retried. I think that happens when we enqueue too many or when they fail? I forget and I wasn't able to figure out what went wrong, but it's definitely from the ready mix project.
To fix it, I looked in redis and found two really huge keys,
unacked
andunacked_index
. For better or worse, I I stopped the script and deleted the keys and that fixed the memory, as above.I restarted ready-mix script, and it seems to be going along nicely again now.
Thoughts:
The script is running now. Hopefully it'll not crash things, but I'll be keeping my laptop handy in case it does.