Open homework36 opened 3 weeks ago
Also, it appears that a single GPU job will take almost all the GPU RAM. It is not necessarily a bug but I'm not sure if GPU celery restarts it will resume the unfinished job or not.
@kyrieb-ekat had a training job run and we see a strange interruption. The GPU-celery container stopped in the middle of a training job and when it restarts it picks up the job in the queue. The unfinished training job is still "processing" on the website which should actually better be something like "aborted" or so. Message from docker service:
"task: non-zero exit (143): dockerexec: unhealthy container"
Here are the health check commands for GPU-celery:
["CMD", "celery", "inspect", "ping", "-A", "rodan", "--workdir", "/code/Rodan", "-d", "celery@GPU", "-t", "30"]
I'm not sure which one fails and leads to the container exit but Error 137 usually indicates a container being killed due to an out-of-memory (OOM) condition.
Again, it will be ideal if users can monitor the job status instead of seeing "processing" for an infinite amount of time when it is actually killed.
Logs from the container:
2ca3b1]: Epoch 38/50
[2024-06-10 10:38:36,261: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, 'PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more', (0, 0), '')
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/celery/worker/worker.py", line 208, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/usr/local/lib/python3.7/dist-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.7/dist-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.7/dist-packages/celery/worker/consumer/consumer.py", line 599, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python3.7/dist-packages/celery/worker/loops.py", line 83, in asynloop
next(loop)
File "/usr/local/lib/python3.7/dist-packages/kombu/asynchronous/hub.py", line 364, in create_loop
cb(*cbargs)
File "/usr/local/lib/python3.7/dist-packages/kombu/transport/base.py", line 238, in on_readable
reader(loop)
File "/usr/local/lib/python3.7/dist-packages/kombu/transport/base.py", line 220, in _read
drain_events(timeout=0)
File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 508, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 514, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python3.7/dist-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python3.7/dist-packages/amqp/connection.py", line 521, in on_inbound_method
method_sig, payload, content,
File "/usr/local/lib/python3.7/dist-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python3.7/dist-packages/amqp/channel.py", line 280, in _on_close
reply_code, reply_text, (class_id, method_id), ChannelError,
amqp.exceptions.PreconditionFailed: (0, 0): (406) PRECONDITION_FAILED - delivery acknowledgement on channel 1 timed out. Timeout value used: 1800000 ms. This timeout value can be configured, see consumers doc guide to learn more
[2024-06-10 10:39:45,490: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3092 - accuracy: 0.9830 - val_loss: 0.3052 - val_accuracy: 0.9829
[2024-06-10 10:39:45,490: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00038: val_accuracy did not improve from 0.98441
[2024-06-10 10:39:45,492: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 39/50
[2024-06-10 10:41:18,370: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3031 - accuracy: 0.9832 - val_loss: 0.3010 - val_accuracy: 0.9821
[2024-06-10 10:41:18,371: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00039: val_accuracy did not improve from 0.98441
[2024-06-10 10:41:18,374: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 40/50
[2024-06-10 10:42:51,260: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: 125/125 - 93s - loss: 0.3017 - accuracy: 0.9832 - val_loss: 0.3019 - val_accuracy: 0.9781
[2024-06-10 10:42:51,261: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 00040: val_accuracy did not improve from 0.98441
[2024-06-10 10:42:51,263: WARNING/ForkPoolWorker-2] Training model for Patchwise Analysis of Music Document, Training[b4eae36a-7370-43c0-b73c-71330f2ca3b1]: Epoch 41/50
make: *** [Makefile:243: gpu-celery_log] Error 137
Also need to be able to send email to users when a job fails. It is implemented here https://github.com/DDMAL/Rodan/blob/e5f620dcfac55721a858ddbec81d85f73bc22dbe/rodan-main/code/rodan/jobs/base.py#L996C1-L996C5 but it clearly does not work.
1164 ; possible to combine issues/close that one to prefer this one?
Sorry, I should've commented in that issue.
@kyrieb-ekat and I discovered that if a training job is running and we want to cancel it, we can manually kill the GPU container on the server. When relaunched, it will automatically proceed to the queued jobs. (In Docker Swarm mode, the container will be recreated automatically, too. So it will be just one line of code docker rm -f <gpu container id>
).
This is from staging:
Sometimes a GPU job takes a long time and we don't want to wait, it is helpful if we can cancel the job and remove it from the quene.
The GPU-celery container crashed and restarted itself and now are are back at 0%