DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

Local rodan fails on celery on M1/M2 machine #1159

Open homework36 opened 1 month ago

homework36 commented 1 month ago

While trying to reproduce errors for issue #1154, found this error running background removal (and I suppose other GPU jobs as well) on local machines:

[2024-05-31 11:51:05,852: INFO/MainProcess] Received task: Background Removal[fe06abc7-2c96-4de2-acc2-e96855333697]  
[2024-05-31 11:51:06,098: INFO/ForkPoolWorker-1] started running the task!
[2024-05-31 11:51:06,547: ERROR/MainProcess] Process 'ForkPoolWorker-1' pid:263 exited with 'signal 6 (SIGABRT)'
[2024-05-31 11:51:06,572: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 6 (SIGABRT) Job: 0.')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
    human_status(exitcode), job._job),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 6 (SIGABRT) Job: 0.

Tried to allocate resources for the celery container. Didn't work. This error message shows up on both M2 and M1 mac minis. Tried steps in Running Rodan on ARM [ARCHIVED] and obtained work not registered error. Also it looks like we do not do different things for M-chip machines anymore. However, when I do everything on an Intel-chip machine, everything works fine.

There was a time we use make run-arm for M-chip machines but somehow we do not have those lines of codes after a certain point.

Update Have the same error log again before running any job. Found the related error msg from GPU-celery container (see below). Screenshot 2024-06-03 at 9 28 29 AM

I also found the old arm_compose.yml which looks different and has been deleted at some point.

If we follow the archived arm guide (see above) without using arm_compose.yml then we will have error messages like these and jobs like background removal will disappear. Screenshot 2024-05-30 at 12 19 45 PM Screenshot 2024-06-03 at 9 50 24 AM

Docker Desktop on an ARM-based system like the Apple M1 utilizes QEMU for emulation to run containers designed for x86 architectures. Now it seems that QEMU is returning error messages.

related post

I personally do not understand the archived ARM instructions. If we don't register GPU related jobs and do not launch the GPU container, these jobs cannot run

. Background Removal
. Fast Pixelwise Analysis of Music Document, Classifying
. SAE Binarization
. Staff Distance Finding
. Text Alignment
. Training model for Patchwise Analysis of Music Document, Training

According to docker forum, the best practice is to have an arm64 or multi-arch image, not just an Intel one. For local users, we should build arm64 based images at least for gpu-celery container (ideally for all). It should not be that complicated with qemu and existing dockerfile etc.

Building multi-platform images under emulation with QEMU is the easiest way to get started if your builder already supports it. Docker Desktop supports it out of the box. It requires no changes to your Dockerfile, and BuildKit automatically detects the secondary architectures that are available.

Docker instructions here.

Update: Installed qemu and there's no more qemu core dumped error but still get Worker exited prematurely: signal 4 (SIGILL) Job: 0..

homework36 commented 1 month ago

Possible solutions (to be tested)

  1. rewrite Dockerfile for arm64 and have a separate gpu-celery container for local rodan on M1/M2
  2. put all gpu jobs to py3 container (maybe not a good idea...)

Although we still have Intel machines in the lab, it seems important that we can run GPU jobs properly on arm machines for local testing, developing, etc.