CentaurusInfra / alnair

Intelligent platform for AI workloads
Apache License 2.0
37 stars 12 forks source link

vgpu-server container failed to start, "run/nvidia-persistenced/socket" no such device or address #119

Open Fizzbb opened 2 years ago

Fizzbb commented 2 years ago

Complete error message:

Error: failed to start container "alnair-vgpu-server": Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/4012b48f38e9057eb80787735e1bb47e7d86c9402d4a4976fd4b07020ae4c63b/merged/run/nvidia-persistenced/socket: no such device or address: unknown

Cause: nvidia-container-runtime initally mount some files under /run/nvidia-persistenced

However, alnair-vgpu-server mount /run to host /run, due to using /run/alnair.sock for communication. So the /run directory's contents got rewritten.

Solutions Change alnair socket path to /run/alnair/alnair.sock, which used in vgpu-server server.go and intercept lib client-register. mount /run/alnair, in the alnair-vgpu-server container