It seems that an issue caused by the NVIDIA persistence mode which is not set for the resumed job
Without the persistence mode - it takes much more time to list the available GPU devices, which may cause the timeout if we are unlucky (according to the "timeout" nature - nvidia-docker may or may not fit into the timing).
Describe the bug From time to time, GPU jobs may fail to resume with
CantainerCannotRun(128)
Kubelet shows the following errors:The similar issue is described at https://github.com/NVIDIA/nvidia-docker/issues/628
It seems that an issue caused by the NVIDIA persistence mode which is not set for the resumed job
Without the persistence mode - it takes much more time to list the available GPU devices, which may cause the timeout if we are unlucky (according to the "timeout" nature -
nvidia-docker
may or may not fit into the timing).For the new compute nodes - the persistence mode is configured and we don't run into such issues https://github.com/epam/cloud-pipeline/blob/6cfc772adc6849c6986cc112b72d3a3cec076a8f/scripts/autoscaling/init_multicloud.sh#L127
We shall add the same behavior for the resumed jobs
Environment:
0.15.0.3559.356e168d1b2a6284c1c115d589a431cac1d67916