NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.53k stars 274 forks source link

cri-o fails to start after nvidia-ctk runtime configure: conmon executable file not found in $PATH #681

Open kznrluk opened 2 months ago

kznrluk commented 2 months ago

I performed the setup in a cri-o environment based on the document below, but afterward, cri-o started failing to launch.

https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o

sudo nvidia-ctk runtime configure --runtime=crio

When I checked journalctl, it seemed that the conmon command could not be found.

Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283245578Z" level=info msg="AppArmor is disabled by the system or at CRI-O build-time"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283252429Z" level=info msg="No blockio config file specified, blockio not configured"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283259598Z" level=info msg="RDT not available in the host system"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283270449Z" level=info msg="Using conmon executable: /usr/libexec/crio/conmon"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283969678Z" level=info msg="Conmon does support the --sync option"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.283985508Z" level=info msg="Conmon does support the --log-global-size-max option"
Sep 07 10:11:39 aki1 crio[7354]: time="2024-09-07 10:11:39.284016218Z" level=fatal msg="validating runtime config: monitor fields translation: failed to translate monitor fields for runtime nvidia: exec: \"conmon\": executable file not found in $PATH"
Sep 07 10:11:39 aki1 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE

In my environment, I resolved the issue by executing ln -s /usr/libexec/crio/conmon /usr/local/bin/conmon, but some kind of modification might be necessary.

> uname -a
Linux aki1 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

> crictl -v
crictl version v1.31.1
Serret commented 2 months ago
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.865508935+01:00" level=info msg="Installing default AppArmor profile: crio-default"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.890008458+01:00" level=info msg="No blockio config file specified, blockio not configured"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.890036521+01:00" level=info msg="RDT not available in the host system"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.890057593+01:00" level=info msg="Using conmon executable: /usr/libexec/crio/conmon"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.890994267+01:00" level=info msg="Conmon does support the --sync option"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.891010121+01:00" level=info msg="Conmon does support the --log-global-size-max option"
Sep 10 01:02:44 odin crio[8386]: time="2024-09-10 01:02:44.891055289+01:00" level=fatal msg="validating runtime config: monitor fields translation: failed to translate monitor fields for runtime nvidia: exec: \"conmon\": executable file not found in $>
Sep 10 01:02:44 odin systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
Linux odin 6.8.0-41-generic #41-Ubuntu SMP PREEMPT_DYNAMIC Fri Aug  2 20:41:06 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

crictl version v1.29.0

Slightly different version of crictl but same situation here. I can confirm the suggestion above worked!

plaurin84 commented 3 weeks ago

I can also confirm that the symlink trick from @kznrluk works perfectly.

Tested with: CRI-O 1.31.1 Kubeadm v1.31.2 NVIDIA Container Runtime Hook version 1.17.0