DDMAL / Rodan

:dragon_face: A web-based workflow engine.
https://rodan2.simssa.ca/
45 stars 13 forks source link

need to be able to cancel job run #1165

Open homework36 opened 3 weeks ago

homework36 commented 3 weeks ago

This is from staging:

Thu Jun  6 14:53:29 2024
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:05.0 Off |                    0 |
| N/A   67C    P0   152W / 149W |  11099MiB / 11441MiB |     99%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     29782      C   /usr/bin/python3.7              11094MiB |
+-----------------------------------------------------------------------------+

Sometimes a GPU job takes a long time and we don't want to wait, it is helpful if we can cancel the job and remove it from the quene.

The GPU-celery container crashed and restarted itself and now are are back at 0%

6xyfw04urdue1xo8187timvcv   rodan_gpu-celery.1       ddmal/rodan-gpu-celery:nightly@sha256:471e11427ca83726fde0544c4d9b5b50912edc81d68290ed526be43a8e20e22a   staging-rodan-gpu-swarm   Running         Starting 23 seconds ago
k1k1jncfq9hu9b1i9r5ru47ib    \_ rodan_gpu-celery.1   ddmal/rodan-gpu-celery:nightly@sha256:471e11427ca83726fde0544c4d9b5b50912edc81d68290ed526be43a8e20e22a   staging-rodan-gpu-swarm   Shutdown        Failed 32 seconds ago     "task: non-zero exit (143): dockerexec: unhealthy container"
homework36 commented 3 weeks ago

Also, it appears that a single GPU job will take almost all the GPU RAM. It is not necessarily a bug but I'm not sure if GPU celery restarts it will resume the unfinished job or not.

homework36 commented 3 weeks ago

@kyrieb-ekat had a training job run and we see a strange interruption. The GPU-celery container stopped in the middle of a training job and when it restarts it picks up the job in the queue. The unfinished training job is still "processing" on the website which should actually better be something like "aborted" or so. Message from docker service:

"task: non-zero exit (143): dockerexec: unhealthy container"

Here are the health check commands for GPU-celery: ["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"] I'm not sure which one fails and leads to the container exit but Error 137 usually indicates a container being killed due to an out-of-memory (OOM) condition. Again, it will be ideal if users can monitor the job status instead of seeing "processing" for an infinite amount of time when it is actually killed. Logs from the container:

2ca3b1]: Epoch 38/50
[2024-06-10 10:38:36,261: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, 'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more', (0, 0), '')
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/celery/worker/worker.py", line 208, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/usr/local/lib/python3.7/dist-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.7/dist-packages/celery/worker/consumer/consumer.py", line 599, in start
    c.loop(*c.loop_args())
  File "/usr/local/lib/python3.7/dist-packages/celery/worker/loops.py", line 83, in asynloop
    next(loop)
  File "/usr/local/lib/python3.7/dist-packages/kombu/asynchronous/hub.py", line 364, in create_loop
    cb(*cbargs)
  File "/usr/local/lib/python3.7/dist-packages/kombu/transport/base.py", line 238, in on_readable
    reader(loop)
  File "/usr/local/lib/python3.7/dist-packages/kombu/transport/base.py", line 220, in _read
    drain_events(timeout=0)
  File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 508, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 514, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.7/dist-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 521, in on_inbound_method
    method_sig, payload, content,
  File "/usr/local/lib/python3.7/dist-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.7/dist-packages/amqp/channel.py", line 280, in _on_close
    reply_code, reply_text, (class_id, method_id), ChannelError,
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
[2024-06-10 10:39:45,490: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3092 - accuracy: 0.9830 - val_loss: 0.3052 - val_accuracy: 0.9829
[2024-06-10 10:39:45,490: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00038: val_accuracy did not improve from 0.98441
[2024-06-10 10:39:45,492: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 39/50
[2024-06-10 10:41:18,370: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3031 - accuracy: 0.9832 - val_loss: 0.3010 - val_accuracy: 0.9821
[2024-06-10 10:41:18,371: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00039: val_accuracy did not improve from 0.98441
[2024-06-10 10:41:18,374: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 40/50
[2024-06-10 10:42:51,260: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3017 - accuracy: 0.9832 - val_loss: 0.3019 - val_accuracy: 0.9781
[2024-06-10 10:42:51,261: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00040: val_accuracy did not improve from 0.98441
[2024-06-10 10:42:51,263: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 41/50
make: *** [Makefile:243: gpu-celery_log] Error 137
homework36 commented 3 weeks ago

Also need to be able to send email to users when a job fails. It is implemented here https://github.com/DDMAL/Rodan/blob/e5f620dcfac55721a858ddbec81d85f73bc22dbe/rodan-main/code/rodan/jobs/base.py#L996C1-L996C5 but it clearly does not work.

kyrieb-ekat commented 3 weeks ago

1164 ; possible to combine issues/close that one to prefer this one?

homework36 commented 3 weeks ago

1164 ; possible to combine issues/close that one to prefer this one?

Sorry, I should've commented in that issue.

homework36 commented 1 week ago

@kyrieb-ekat and I discovered that if a training job is running and we want to cancel it, we can manually kill the GPU container on the server. When relaunched, it will automatically proceed to the queued jobs. (In Docker Swarm mode, the container will be recreated automatically, too. So it will be just one line of code docker rm -f <gpu container id>).