Closed admiralbolt closed 2 months ago
The latest manual run managed to finish successfully in ~3 hours. Could just be so much data gets backed up that it was taking forever:
Will let the auto-scheduled version run once to see if this is still timing out.
Might need to make this smarter, can probably spawn a thread that has a 1 hour lifetime. If any requests hang or anything like that, it won't crash the whole thing.
It's a CPU leak!
Looks like we've got a whole bunch of hanging chromium processes:
$ ps aux | grep chromium | wc -l
257
Cleaned up all the orphaned chromium instances, this is very obviously the issue, but now the question becomes why?
I'm not sure how expensive the actual create driver call is, but we'll start by creating and closing a driver per request to manage our instances.
With the commit dbfa471d404db06c8c696d14f4eee05cfdd162dd built in prod now, we'll see tomorrow if our CPU hits 99%!
Okay! It's working BUT, we are launching too many jobs:
Same with carpenter / janitor, seems like we are running them all every couple of hours instead of ON hour = X.
Looking a little better, but still leaking some
Attempting to remove the multi-start business from the celeryd config.
From the celerybeat log ( /var/run/celery/beat.log ), we are only sending out a message once.
But the worker log seems to have lots of received messages:
[2024-08-27 02:39:40,380: INFO/MainProcess] Task api.tasks.run_carpenter[688d0fca-cfdf-4908-903b-86a515ec4fe5] received
[2024-08-27 02:39:45,561: INFO/MainProcess] Task api.tasks.run_janitor[992037ff-fc9c-470e-b797-4bcff573c16b] received
[2024-08-27 04:08:42,072: INFO/MainProcess] Task api.tasks.run_ingester[28f05b0a-5594-4059-a62e-198f2b122463] received
[2024-08-27 04:11:29,160: INFO/MainProcess] Task api.tasks.delete_old_ingestion_runs[ec5bf3ae-4e50-4a79-9833-7544b7b16dc4] received
[2024-08-27 04:11:29,182: ERROR/MainProcess] Received unregistered task of type 'api.tasks.import_and_clean'.
The message has been ignored and discarded.
Did you remember to import the module containing this task?
Or maybe you're using relative imports?
Please see
https://docs.celeryq.dev/en/latest/internals/protocol.html
for more information.
The full contents of the message body was:
'[[], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (77b)
The full contents of the message headers:
{'lang': 'py', 'task': 'api.tasks.import_and_clean', 'id': '3891f630-7aa8-4665-af7e-83db0e04c310', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '3891f630-7aa8-4665-af7e-83db0e04c310', 'parent_id': None, 'argsrepr': '()', 'kwargsrepr': '{}', 'origin': 'gen3660618@sms-prod', 'ignore_result': False, 'replaced_task_nesting': 0, 'stamped_headers': None, 'stamps': {}}
The delivery info for this task is:
{'consumer_tag': 'None4', 'delivery_tag': 30, 'redelivered': False, 'exchange': '', 'routing_key': 'celery'}
Traceback (most recent call last):
File "/home/admiralbolt/deployments/venv/lib/python3.9/site-packages/celery/worker/consumer/consumer.py", line 658, in on_task_received
strategy = strategies[type_]
KeyError: 'api.tasks.import_and_clean'
[2024-08-27 04:11:29,184: INFO/MainProcess] Task api.tasks.run_carpenter[dcb4aa68-ac65-436f-8bb3-96c4d33a9fac] received
[2024-08-27 04:11:32,843: INFO/MainProcess] Task api.tasks.run_janitor[56633d30-92a6-4547-9d1c-7310f6372d1a] received
[2024-08-27 05:39:16,296: INFO/MainProcess] Task api.tasks.run_ingester[3509f50f-cc4b-4a53-a1bc-943c0bd21fb8] received
[2024-08-27 06:26:09,567: INFO/MainProcess] Task api.tasks.delete_old_ingestion_runs[a8dca582-87f0-4777-bdfb-7a48d356dae8] received
[2024-08-27 06:26:09,571: ERROR/MainProcess] Received unregistered task of type 'api.tasks.import_and_clean'.
From what I can tell this is because of the multi start I was using previously in the daemon version of celeryd.
Though, I swear it was working before and only running once a day. :shrug:
Getting closer, down to only a single run, but it didn't complete successfully.
Looks like we just hit 10 hours of processing even though we didn't finish everything. There's probably just a random network call that's hanging, and causing this timeout, just not every time. Will look at adding a timer to the SB stuff.
ug 28 18:00:00 sms-prod sh[3860772]: raise TimeLimitExceeded(job._timeout)
Aug 28 18:00:00 sms-prod sh[3860772]: billiard.einfo.ExceptionWithTraceback:
Aug 28 18:00:00 sms-prod sh[3860772]: """
Aug 28 18:00:00 sms-prod sh[3860772]: Traceback (most recent call last):
Aug 28 18:00:00 sms-prod sh[3860772]: File "/home/admiralbolt/deployments/venv/lib/python3.9/site-packages/billiard/pool.py", line 683, in on_hard_timeout
Aug 28 18:00:00 sms-prod sh[3860772]: raise TimeLimitExceeded(job._timeout)
Aug 28 18:00:00 sms-prod sh[3860772]: billiard.exceptions.TimeLimitExceeded: TimeLimitExceeded(36000,)
Aug 28 18:00:00 sms-prod sh[3860772]: """
Aug 28 18:00:00 sms-prod sh[3860772]: [2024-08-28 11:00:00,050: ERROR/MainProcess] Hard time limit (36000s) exceeded for api.tasks.run_ingester[11cae47d-60bb-42f6-b48f-c2cb58cda782]
Aug 28 18:00:00 sms-prod sh[3860772]: [2024-08-28 11:00:00,250: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:3860780 exited with 'signal 9 (SIGKILL)'
OKAY, so we are NOT hitting the 45 second trawler timeout at all. Strange that we are hitting the 10 hour time limit when a manual run of the thing finishes in a few hours though. More work is getting done when it's a user process somehow.
Also seems like if we hit this wall, we create a bunch of chromium zombies
Judging from the log entries, we are taking about 3 hours to process bandsintown, but that should be the slow one. I think if we add similar logging to the rest of the apis it'll be easier to track what's going on here.
Metadata has been added and is working, going to do a local run of songkick ingester to see if I can see somethin funny happenin
Turns out songkicks page is just written like garbage. They have a bunch of calls that block the page load from finishing, and a million errors when you open their page.
Still, something funny happening that's causing the whole thing to timeout. Gonna deploy and look at tomorrow's logs.
Wooohooooooo!!!!!
And no hanging chrome processes this time! I'll keep monitoring in case this happens again, but we should be in a much better state long-term because of these changes.
Ingester is taking a million years to import from bandsintown and songkick and is blocking everything.
Grah