epam / cloud-pipeline

Cloud agnostic genomics analysis, scientific computation and storage platform
https://cloud-pipeline.com
Apache License 2.0
146 stars 59 forks source link

GPU jobs may fail to resume with CantainerCannotRun(128) #828

Closed sidoruka closed 4 years ago

sidoruka commented 4 years ago

Describe the bug From time to time, GPU jobs may fail to resume with CantainerCannotRun(128) Kubelet shows the following errors:

E1209 18:02:22.345244    7212 remote_runtime.go:208] StartContainer "329e489c65a2eaa71e97cd59d9840502b0a5509798ca6b5df2aad4989789bf12" from runtime service failed: rpc error: code = 2 desc = failed to start container "329e489c65a2eaa71e97cd59d9840502b0a5509798ca6b5df2aad4989789bf12": Error response from daemon: {"message":"OCI runtime create failed: container_linux.go:348: starting container process caused \"process_linux.go:402: container init caused \\\"process_linux.go:385: running prestart hook 0 caused \\\\\\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: driver error: timed out\\\\\\\\n\\\\\\\"\\\"\": unknown"}

The similar issue is described at https://github.com/NVIDIA/nvidia-docker/issues/628

It seems that an issue caused by the NVIDIA persistence mode which is not set for the resumed job

Without the persistence mode - it takes much more time to list the available GPU devices, which may cause the timeout if we are unlucky (according to the "timeout" nature - nvidia-docker may or may not fit into the timing).

For the new compute nodes - the persistence mode is configured and we don't run into such issues https://github.com/epam/cloud-pipeline/blob/6cfc772adc6849c6986cc112b72d3a3cec076a8f/scripts/autoscaling/init_multicloud.sh#L127

We shall add the same behavior for the resumed jobs

Environment:

sidoruka commented 4 years ago

Both develop and release/0.15 are patched