Investigate Celery hanging due to redis.exceptions.BusyLoadingError

WesleyAC commented 1 year ago

Celery recently hung — the worker was alive, but no tasks were active, and it was looping the following error:

2023-03-08T09:02:25.131490646Z [2023-03-08 09:02:25,130: ERROR/MainProcess] Task handler raised error: BusyLoadingError('Redis is loading the dataset in memory')
2023-03-08T09:02:25.131531398Z Traceback (most recent call last):
2023-03-08T09:02:25.131536841Z   File "/usr/local/lib/python3.9/site-packages/billiard/pool.py", line 362, in workloop
2023-03-08T09:02:25.131541733Z     result = (True, prepare_result(fun(*args, **kwargs)))
2023-03-08T09:02:25.131546199Z   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 649, in fast_trace_task
2023-03-08T09:02:25.131550950Z     R, I, T, Rstr = tasks[task].__trace__(
2023-03-08T09:02:25.131555363Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/celery.py", line 181, in _inner
2023-03-08T09:02:25.131559910Z     return f(*args, **kwargs)
2023-03-08T09:02:25.131563835Z   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 572, in trace_task
2023-03-08T09:02:25.131568265Z     I, _, _, _ = on_error(task_request, exc, uuid)
2023-03-08T09:02:25.131572271Z   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 379, in on_error
2023-03-08T09:02:25.131576702Z     R = I.handle_error_state(
2023-03-08T09:02:25.131581122Z   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 178, in handle_error_state
2023-03-08T09:02:25.131585313Z     return {
2023-03-08T09:02:25.131589054Z   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 225, in handle_failure
2023-03-08T09:02:25.131593492Z     task.backend.mark_as_failure(
2023-03-08T09:02:25.131597611Z   File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 172, in mark_as_failure
2023-03-08T09:02:25.131602130Z     self.store_result(task_id, exc, state,
2023-03-08T09:02:25.131606713Z   File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 528, in store_result
2023-03-08T09:02:25.131611105Z     self._store_result(task_id, result, state, traceback,
2023-03-08T09:02:25.131615479Z   File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 956, in _store_result
2023-03-08T09:02:25.131619537Z     current_meta = self._get_task_meta_for(task_id)
2023-03-08T09:02:25.131623810Z   File "/usr/local/lib/python3.9/site-packages/celery/backends/base.py", line 978, in _get_task_meta_for
2023-03-08T09:02:25.131627908Z     meta = self.get(self.get_key_for_task(task_id))
2023-03-08T09:02:25.131631996Z   File "/usr/local/lib/python3.9/site-packages/celery/backends/redis.py", line 368, in get
2023-03-08T09:02:25.131635893Z     return self.client.get(key)
2023-03-08T09:02:25.131640212Z   File "/usr/local/lib/python3.9/site-packages/redis/client.py", line 1579, in get
2023-03-08T09:02:25.131644588Z     return self.execute_command('GET', name)
2023-03-08T09:02:25.131649951Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/redis.py", line 167, in sentry_patched_execute_command
2023-03-08T09:02:25.131672827Z     return old_execute_command(self, name, *args, **kwargs)
2023-03-08T09:02:25.131677677Z   File "/usr/local/lib/python3.9/site-packages/redis/client.py", line 878, in execute_command
2023-03-08T09:02:25.131681890Z     return self.parse_response(conn, command_name, **options)
2023-03-08T09:02:25.131685994Z   File "/usr/local/lib/python3.9/site-packages/redis/client.py", line 892, in parse_response
2023-03-08T09:02:25.131690597Z     response = connection.read_response()
2023-03-08T09:02:25.131694482Z   File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 734, in read_response
2023-03-08T09:02:25.131698442Z     response = self._parser.read_response()
2023-03-08T09:02:25.131702239Z   File "/usr/local/lib/python3.9/site-packages/redis/connection.py", line 333, in read_response
2023-03-08T09:02:25.131706636Z     raise error
2023-03-08T09:02:25.131710426Z redis.exceptions.BusyLoadingError: Redis is loading the dataset in memory

I have no idea what caused this. ./bw-dev restart_celery was sufficient to fix it, but I'd like to debug the actual problem here. Potentially a problem with Redis or the connection pool running out of connection slots?

WesleyAC commented 1 year ago

My current suspicion is that this happens when the broker redis server restarts (unclear why this happens, docker logs are unhelpful) and takes long enough to come back up that celery stops retrying. I've set broker_connection_max_retries to 0 (infinite) on bookwyrm.social to see if this fixes the issue, although I still want to dig into why redis is restarting (it's the docker container that's being restarted, according to docker ps)

WesleyAC commented 1 year ago

broker_connection_max_retries doesn't seem to fix this, but i don't understand why.

oh, and redis was getting killed by the OOM killer, it turns out, which makes sense.

WesleyAC commented 1 year ago

Experimenting now with setting broker_connection_max_retries much lower, under the theory that will cause celery to die and restart when this happens, as opposed to trying and failing to connect forever.

Interesting theory in https://github.com/celery/celery/issues/4556 that it may be due to firewall configuration.

Also, another reason redis dies sometimes is running out of disk space, mostly due to the bug with CSV exports (#2157).

bookwyrm-social / bookwyrm

Investigate Celery hanging due to redis.exceptions.BusyLoadingError #2719