Upgrade from 1.13.5 to 1.14.0/1.14.1 causing container crash & GL errors

ewaldmire commented 1 year ago

I'm using a container from: https://hub.docker.com/r/netbrain/zwift

Which is from: https://github.com/netbrain/zwift/

It works fine with 1.13.5. Once I upgrade to 1.14.0 (or 1.14.1, I've tried both) - I get this error in my logs and the container crashes:

Sep 13 09:25:16 computername naughty_gauss[3661]: libGL error: No matching fbConfigs or visuals found Sep 13 09:25:16 computername naughty_gauss[3661]: libGL error: failed to load driver: swrast Sep 13 09:25:16 computername naughty_gauss[3661]: X Error of failed request: GLXBadContext Sep 13 09:25:16 computername naughty_gauss[3661]: Major opcode of failed request: 152 (GLX) Sep 13 09:25:16 computername naughty_gauss[3661]: Minor opcode of failed request: 6 (X_GLXIsDirect)

If I roll back to 1.13.5 the container works again. I'm running it with podman and sudo, so at first I thought it was related to this bug:

https://github.com/NVIDIA/nvidia-container-toolkit/issues/106

...but the 1.14.1 patch also has this issue. Thank you for your work on this project and please let me know if there's any additional detail I can provide to help troubleshoot.

elezar commented 1 year ago

@ewaldmire how were you triggering the NVIDIA tooling in podman? Was the NVIDIA Container Runtime being set as a podman runtime, or were you using the OCI hook approach?

The OCI hook was removed as part of the 1.14.0 release as this conflicts with the use of CDI.

You could try run the same podman command with the --runtime=/usr/bin/nvidia-container-runtime to confirm that this is the cause.

ewaldmire commented 1 year ago

@elezar Thank you so much. I really appreciate your help.

I added your suggested line:

--runtime=/usr/bin/nvidia-container-runtime

and that gave me this error:

ERRO[0000] failed to create NVIDIA Container Runtime: error constructing low-level runtime: error locating runtime: no runtime binary found from candidate list: [docker-runc runc] 
ERRO[0000] Removing container 0bd7fcb5370b37fd19a085de16c3faac9646fca151a6a2dff55837739723d5ef from runtime after creation failed 
Error: OCI runtime error: /usr/bin/nvidia-container-runtime: time="2023-09-13T18:20:24-05:00" level=error msg="failed to create NVIDIA Container Runtime: error constructing low-level runtime: error locating runtime: no runtime binary found from candidate list: [docker-runc runc]"

so I did an:

dnf install -y runc

...and now it works again.

I admittedly don't understand all the pieces, so if you or anyone else can contribute a brief explanation for container newbies - that would be much appreciated. I especially don't understand what could have been doing the function of "runc" before it was installed.

Thank you again so much for helping me work towards a solution!

elezar commented 1 year ago

@ewaldmire on some (possibly even most) systems, podman uses crun as the low-level runtime. You can remove the runc dependency by editing the /etc/nvidia-container-runtime/config.toml file and adding crun to the list that currently contains docker-runc and runc.

What the NVIDIA Container Runtime does is modify the incomming OCI runtime spec -- adding the NVIDIA Container Runtime Hook as a prestart hook, for example -- before invoking a lowlevel runtime (e.g. runc or crun) with the same arguments that were passed to it. In your case, runc does not exist and this failed.

With this in mind, I do think that we can change our defaults to include crun so that we have better integration on systems where this is required.

NVIDIA / nvidia-container-toolkit

Upgrade from 1.13.5 to 1.14.0/1.14.1 causing container crash & GL errors #109