Closed bviktor closed 3 years ago
In case you're wondering why we want to change the default runtime:
https://github.com/NVIDIA/k8s-device-plugin
You will need to enable the nvidia runtime as your default runtime on your node
Thanks for reporting; looks like it's panicking here; https://github.com/moby/moby/blob/v20.10.1/plugin/executor/containerd/containerd.go#L71
Looks related to https://github.com/moby/moby/commit/f63f73a4a8f531813d6b46a2347cab4bfd210df7 (part of https://github.com/moby/moby/pull/41182)
/cc @cpuguy83 PTAL
https://github.com/moby/moby/pull/41854 Should take care of this.
I ran into this exact same issue, but I had not customized the default runtime. I was running Docker 20.10.1 on Ubuntu 20.04 LTS. Everything had been running fine on the machine for about 24 hours, and suddenly dockerd
just died. I couldn't restart it, but running dockerd
directly gave the same kernel panic error as in this bug report. I tried uninstalling and reinstalling all Docker packages on the machine, to no avail. In the end, since it was an Ansible-managed VM I just destroyed the VM and rebuilt it - then it worked fine.
moby/moby#41854 Should take care of this.
Sweet, thanks a lot! Is there a way I can test this?
Our instances upgraded to 20.10.2 and the issue indeed seems to have been resolved, thanks!
Apparently 20.10.3 broke this again. Please reopen the issue :)
Docker 20.10.3 was a security release. Security releases usually only contain security fixes, and won't be combined with other fixes; the fix is being back ported for docker 20.10.4 (https://github.com/moby/moby/pull/41974)
Thanks for the info!
Expected behavior
The expected behavior is that Docker starts successfully if you set the default runtime.
Actual behavior
Docker fails to start.
Steps to reproduce the behavior
Install Docker CE on Ubuntu 18.04 as per the official docs.
Observe that the
docker
service starts successfully withsystemctl status docker.service
.Set up
/etc/docker/daemon.json
as per the nvidia-container-runtime docs:Restart the service with
systemctl restart docker.service
and observe its failure:Check the details with
journalctl -u docker
:Now modify
daemon.json
to exclude thedefault-runtime
setting:Restart Docker with
systemctl restart docker.service
and observe that it starts successfully. If you put back the default runtime setting again, it fails again.At this point you'll possibly also realize that Docker tries to restart itself too often, so even if you remove the the default runtime, you might not be able to restart Docker right away, because systemd blocks it for a while:
(Which hightlights another problem, that
RestartSec=2
should definitely be increased to 15 seconds or so in/lib/systemd/system/docker.service
, but this ticket is not about that, I just want to point it out so that you don't run into this, as I did. Always check the failure cause withjournalctl -fu docker.service
before concluding that your config is wrong.)Now that you have successfully started Docker with the runtime being defined but without being set as the default, confirm that the runtime is actually operable when set explicitly during use:
Check for the available
docker-ce
versions:Try downgrading to 20.10.0:
And observe that the Docker service still fails to start with the default runtime set. Downgrade to 19.03.14:
And observe that the Docker service starts successfully even with the default runtime set. Now your container output will behave as it should:
Instead of the previous
bash: nvidia-smi: command not found
error.We've been using this configuration for over a year now. To me it seems like a regression in Docker CE 20. Please advise.
Output of
docker version
:Output of
docker info
:Additional environment details (AWS, VirtualBox, physical, etc.)
Ubuntu 18.04.5 with all updates installed, on several physical computers.