bihealth / sodar-server

SODAR: System for Omics Data Access and Retrieval
https://github.com/bihealth/sodar-server
MIT License
14 stars 3 forks source link

Crash in landing_zone_create flow with large project and create/restrict collections #1905

Closed mikkonie closed 5 months ago

mikkonie commented 5 months ago

Recently I optimized landing_zone_move to work better with very large projects. However, it seems that landing_zone_create can still fail.

I just witnessed the create flow crashing on a large project of 5000+ samples with create_collectons enabled. Upon first glance it would seem it already crashed in the PREPARING state. I also verified that the celery job has since been terminated, so this wasn't just a case of "unoptimized code runs for days".

Celery reports the following (not 100% sure if related, but nothing else was failing at the time, so most likely):

[2024-02-16 10:10:33,828: ERROR/MainProcess] Process 'ForkPoolWorker-15' pid:40 exited with 'signal 9 (SIGKILL)'
[2024-02-16 10:10:34,487: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL) Job: 3189.')

Can't see the SODAR logs since I can't access it in our Sentry at the moment.

I'll have to try to reproduce this locally with a similar sample sheet.

mikkonie commented 5 months ago

Confirmed to happen locally on a similarly sized sample sheet. Froze my dev laptop for 10 minutes before crashing with the same error as above, confirmed as relevant then. Next step: look closer into the flow and debug.

EDIT: This works locally if create_colls=True but restrict_colls=False. Either there is something crash-prone in SetAccessTask, or taskflow simply can't handle such a large number of tasks in a linear queue. In the latter case, moving the functionality of that task (and potentially CreateCollectionTask) into a batch-based task should help. I'll look into it further.

mikkonie commented 5 months ago

Switching from CreateCollectionTask to BatchCreateCollectionTask already made this work on my dev machine, albeit slowly. I'll set up a batch version of SetAccessTask and use it here. After that we should again be future proof for a little while :)

mikkonie commented 5 months ago

Fixed.