admiralbolt / sms

Seattle Music Scene
1 stars 0 forks source link

Fix ingester latency #93

Closed admiralbolt closed 2 months ago

admiralbolt commented 2 months ago

Ingester is taking a million years to import from bandsintown and songkick and is blocking everything.

Grah

admiralbolt commented 2 months ago

The latest manual run managed to finish successfully in ~3 hours. Could just be so much data gets backed up that it was taking forever:

Screenshot 2024-08-25 at 3 11 52 PM
admiralbolt commented 2 months ago

Will let the auto-scheduled version run once to see if this is still timing out.

Might need to make this smarter, can probably spawn a thread that has a 1 hour lifetime. If any requests hang or anything like that, it won't crash the whole thing.

admiralbolt commented 2 months ago

It's a CPU leak!

image

Looks like we've got a whole bunch of hanging chromium processes:

$ ps aux | grep chromium | wc -l
257
admiralbolt commented 2 months ago

Cleaned up all the orphaned chromium instances, this is very obviously the issue, but now the question becomes why?

admiralbolt commented 2 months ago

I'm not sure how expensive the actual create driver call is, but we'll start by creating and closing a driver per request to manage our instances.

admiralbolt commented 2 months ago

With the commit dbfa471d404db06c8c696d14f4eee05cfdd162dd built in prod now, we'll see tomorrow if our CPU hits 99%!

admiralbolt commented 2 months ago

Okay! It's working BUT, we are launching too many jobs:

Screenshot 2024-08-27 at 4 34 03 PM

Same with carpenter / janitor, seems like we are running them all every couple of hours instead of ON hour = X.

admiralbolt commented 2 months ago
image

Looking a little better, but still leaking some

admiralbolt commented 2 months ago

Attempting to remove the multi-start business from the celeryd config.

From the celerybeat log ( /var/run/celery/beat.log ), we are only sending out a message once.

But the worker log seems to have lots of received messages:

[2024-08-27 02:39:40,380: INFO/MainProcess] Task api.tasks.run_carpenter[688d0fca-cfdf-4908-903b-86a515ec4fe5] received
[2024-08-27 02:39:45,561: INFO/MainProcess] Task api.tasks.run_janitor[992037ff-fc9c-470e-b797-4bcff573c16b] received
[2024-08-27 04:08:42,072: INFO/MainProcess] Task api.tasks.run_ingester[28f05b0a-5594-4059-a62e-198f2b122463] received
[2024-08-27 04:11:29,160: INFO/MainProcess] Task api.tasks.delete_old_ingestion_runs[ec5bf3ae-4e50-4a79-9833-7544b7b16dc4] received
[2024-08-27 04:11:29,182: ERROR/MainProcess] Received unregistered task of type 'api.tasks.import_and_clean'.
The message has been ignored and discarded.

Did you remember to import the module containing this task?
Or maybe you're using relative imports?

Please see
https://docs.celeryq.dev/en/latest/internals/protocol.html
for more information.

The full contents of the message body was:
'[[], {}, {"callbacks": null, "errbacks": null, "chain": null, "chord": null}]' (77b)

The full contents of the message headers:
{'lang': 'py', 'task': 'api.tasks.import_and_clean', 'id': '3891f630-7aa8-4665-af7e-83db0e04c310', 'shadow': None, 'eta': None, 'expires': None, 'group': None, 'group_index': None, 'retries': 0, 'timelimit': [None, None], 'root_id': '3891f630-7aa8-4665-af7e-83db0e04c310', 'parent_id': None, 'argsrepr': '()', 'kwargsrepr': '{}', 'origin': 'gen3660618@sms-prod', 'ignore_result': False, 'replaced_task_nesting': 0, 'stamped_headers': None, 'stamps': {}}

The delivery info for this task is:
{'consumer_tag': 'None4', 'delivery_tag': 30, 'redelivered': False, 'exchange': '', 'routing_key': 'celery'}
Traceback (most recent call last):
  File "/home/admiralbolt/deployments/venv/lib/python3.9/site-packages/celery/worker/consumer/consumer.py", line 658, in on_task_received
    strategy = strategies[type_]
KeyError: 'api.tasks.import_and_clean'
[2024-08-27 04:11:29,184: INFO/MainProcess] Task api.tasks.run_carpenter[dcb4aa68-ac65-436f-8bb3-96c4d33a9fac] received
[2024-08-27 04:11:32,843: INFO/MainProcess] Task api.tasks.run_janitor[56633d30-92a6-4547-9d1c-7310f6372d1a] received
[2024-08-27 05:39:16,296: INFO/MainProcess] Task api.tasks.run_ingester[3509f50f-cc4b-4a53-a1bc-943c0bd21fb8] received
[2024-08-27 06:26:09,567: INFO/MainProcess] Task api.tasks.delete_old_ingestion_runs[a8dca582-87f0-4777-bdfb-7a48d356dae8] received
[2024-08-27 06:26:09,571: ERROR/MainProcess] Received unregistered task of type 'api.tasks.import_and_clean'.
admiralbolt commented 2 months ago

From what I can tell this is because of the multi start I was using previously in the daemon version of celeryd.

Though, I swear it was working before and only running once a day. :shrug:

admiralbolt commented 2 months ago

Getting closer, down to only a single run, but it didn't complete successfully.

admiralbolt commented 2 months ago

Looks like we just hit 10 hours of processing even though we didn't finish everything. There's probably just a random network call that's hanging, and causing this timeout, just not every time. Will look at adding a timer to the SB stuff.

ug 28 18:00:00 sms-prod sh[3860772]:     raise TimeLimitExceeded(job._timeout)
Aug 28 18:00:00 sms-prod sh[3860772]: billiard.einfo.ExceptionWithTraceback:
Aug 28 18:00:00 sms-prod sh[3860772]: """
Aug 28 18:00:00 sms-prod sh[3860772]: Traceback (most recent call last):
Aug 28 18:00:00 sms-prod sh[3860772]:   File "/home/admiralbolt/deployments/venv/lib/python3.9/site-packages/billiard/pool.py", line 683, in on_hard_timeout
Aug 28 18:00:00 sms-prod sh[3860772]:     raise TimeLimitExceeded(job._timeout)
Aug 28 18:00:00 sms-prod sh[3860772]: billiard.exceptions.TimeLimitExceeded: TimeLimitExceeded(36000,)
Aug 28 18:00:00 sms-prod sh[3860772]: """
Aug 28 18:00:00 sms-prod sh[3860772]: [2024-08-28 11:00:00,050: ERROR/MainProcess] Hard time limit (36000s) exceeded for api.tasks.run_ingester[11cae47d-60bb-42f6-b48f-c2cb58cda782]
Aug 28 18:00:00 sms-prod sh[3860772]: [2024-08-28 11:00:00,250: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:3860780 exited with 'signal 9 (SIGKILL)'
admiralbolt commented 2 months ago

OKAY, so we are NOT hitting the 45 second trawler timeout at all. Strange that we are hitting the 10 hour time limit when a manual run of the thing finishes in a few hours though. More work is getting done when it's a user process somehow.

Also seems like if we hit this wall, we create a bunch of chromium zombies

admiralbolt commented 2 months ago

Judging from the log entries, we are taking about 3 hours to process bandsintown, but that should be the slow one. I think if we add similar logging to the rest of the apis it'll be easier to track what's going on here.

admiralbolt commented 2 months ago

Metadata has been added and is working, going to do a local run of songkick ingester to see if I can see somethin funny happenin

admiralbolt commented 2 months ago

Turns out songkicks page is just written like garbage. They have a bunch of calls that block the page load from finishing, and a million errors when you open their page.

Still, something funny happening that's causing the whole thing to timeout. Gonna deploy and look at tomorrow's logs.

admiralbolt commented 2 months ago

Wooohooooooo!!!!!

Screenshot 2024-08-30 at 3 34 30 PM
admiralbolt commented 2 months ago

And no hanging chrome processes this time! I'll keep monitoring in case this happens again, but we should be in a much better state long-term because of these changes.