NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

node draining fails & transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused #426

Open wjentner opened 2 years ago

wjentner commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

The driver pod fails to drain the node:

nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.mig-manager='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nodeType='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 labeled
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager pod/nvidia-operator-validator-9rrp6 condition met
nvidia-driver-daemonset-c78b4 k8s-driver-manager nvidia driver module is already loaded with refcount 116
nvidia-driver-daemonset-c78b4 k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 labeled
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the container-toolkit to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager pod/nvidia-container-toolkit-daemonset-x9xrt condition met
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the device-plugin to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for gpu-feature-discovery to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for dcgm-exporter to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for dcgm to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 k8s-driver-manager nvidia_modeset       1142784  0
nvidia-driver-daemonset-c78b4 k8s-driver-manager nvidia_uvm           1163264  2
nvidia-driver-daemonset-c78b4 k8s-driver-manager nvidia              40796160  116 nvidia_uvm,nvidia_modeset
nvidia-driver-daemonset-c78b4 k8s-driver-manager drm                   495616  6 drm_kms_helper,drm_vram_helper,bochs_drm,nvidia,ttm
nvidia-driver-daemonset-c78b4 k8s-driver-manager Could not unload NVIDIA driver kernel modules, driver is in use
nvidia-driver-daemonset-c78b4 k8s-driver-manager Unable to cleanup driver modules, attempting again with node drain...
nvidia-driver-daemonset-c78b4 k8s-driver-manager Draining node kube-gpu1...
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 cordoned
nvidia-driver-daemonset-c78b4 k8s-driver-manager error: unable to drain node "kube-gpu1" due to error:cannot delete Pods with local storage (use --delete-emptydir-data to override): monitoring/alertmanager-dbvis-prometheus-kube-prom-alertmanager-0, continuing command...
nvidia-driver-daemonset-c78b4 k8s-driver-manager There are pending nodes to be drained:
nvidia-driver-daemonset-c78b4 k8s-driver-manager  kube-gpu1
nvidia-driver-daemonset-c78b4 k8s-driver-manager cannot delete Pods with local storage (use --delete-emptydir-data to override): monitoring/alertmanager-dbvis-prometheus-kube-prom-alertmanager-0
nvidia-driver-daemonset-c78b4 k8s-driver-manager Uncordoning node kube-gpu1...
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 uncordoned
nvidia-driver-daemonset-c78b4 k8s-driver-manager Rescheduling all GPU clients on the current node by enabling their component-specific nodeSelector labels
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 labeled

After manual draining the following error appears:

nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.operator-validator' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.operator-validator=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.container-toolkit' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.container-toolkit=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.device-plugin' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.device-plugin=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.gpu-feature-discovery' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.gpu-feature-discovery=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm-exporter' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm-exporter=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.dcgm' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.dcgm=true'
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.mig-manager' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.mig-manager='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.nvsm' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.nvsm='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-validator' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-validator='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.sandbox-device-plugin' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.sandbox-device-plugin='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nvidia.com/gpu.deploy.vgpu-device-manager' node label
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nvidia.com/gpu.deploy.vgpu-device-manager='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Getting current value of the 'nodeType' node label(used by NVIDIA Fleet Command)
nvidia-driver-daemonset-c78b4 k8s-driver-manager Current value of 'nodeType='
nvidia-driver-daemonset-c78b4 k8s-driver-manager Shutting GPU Operator components that must be restarted on driver restarts by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 labeled
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager pod/nvidia-operator-validator-mmm89 condition met
nvidia-driver-daemonset-c78b4 k8s-driver-manager nvidia driver module is already loaded with refcount 116
nvidia-driver-daemonset-c78b4 k8s-driver-manager Shutting down all GPU clients on the current node by disabling their component-specific nodeSelector labels
nvidia-driver-daemonset-c78b4 k8s-driver-manager node/kube-gpu1 labeled
nvidia-driver-daemonset-c78b4 k8s-driver-manager Waiting for the operator-validator to shutdown
nvidia-driver-daemonset-c78b4 k8s-driver-manager rpc error: code = Unavailable desc = connection closed
- nvidia-driver-daemonset-c78b4 › k8s-driver-manager
+ nvidia-driver-daemonset-c78b4 › nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr DRIVER_ARCH is x86_64
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Creating directory NVIDIA-Linux-x86_64-515.65.01
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Verifying archive integrity... OK
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 515.65.01................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ========== NVIDIA Software Installer ==========
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Starting installation of NVIDIA driver version 515.65.01 for Linux kernel version 5.4.0-128-generic
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Stopping NVIDIA persistence daemon...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Unloading NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Unmounting NVIDIA driver rootfs...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Checking NVIDIA driver packages...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Updating the package cache...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Resolving Linux kernel version...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Proceeding with Linux kernel version 5.4.0-128-generic
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installing Linux kernel headers...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installing Linux kernel module files...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Generating Linux kernel version string...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Compiling NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia/nv-dma.c: In function 'nv_dma_use_map_resource':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia/nv-dma.c:783:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   783 |     const struct dma_map_ops *ops = get_dma_ops(dma_dev->dev);
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |     ^~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr In file included from ./include/linux/list.h:9,
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  from ./include/linux/preempt.h:11,
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  from ./include/linux/spinlock.h:51,
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  from /usr/src/nvidia-515.65.01/kernel/common/inc/nv-lock.h:29,
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  from /usr/src/nvidia-515.65.01/kernel/common/inc/nv-linux.h:32,
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  from /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c:26:
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c: In function 'nv_get_max_sysmem_address':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ./include/linux/kernel.h:842:29: warning: comparison of distinct pointer types lacks a cast
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   842 |   (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                             ^~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ./include/linux/kernel.h:856:4: note: in expansion of macro '__typecheck'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   856 |   (__typecheck(x, y) && __no_side_effects(x, y))
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |    ^~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ./include/linux/kernel.h:866:24: note: in expansion of macro '__safe_cmp'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   866 |  __builtin_choose_expr(__safe_cmp(x, y), \
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                        ^~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ./include/linux/kernel.h:882:19: note: in expansion of macro '__careful_cmp'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   882 | #define max(x, y) __careful_cmp(x, y, >)
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                   ^~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia/nv-vm.c:225:26: note: in expansion of macro 'max'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   225 |         global_max_pfn = max(global_max_pfn, node_end_pfn(node_id));
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                          ^~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia/nv-procfs.o: warning: objtool: .text.unlikely: unexpected end of section
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c: In function 'uvm_pmm_gpu_alloc_kernel':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-uvm/uvm_pmm_gpu.c:645:16: warning: unused variable 'gpu' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   645 |     uvm_gpu_t *gpu = uvm_pmm_to_gpu(pmm);
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                ^~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'cursor_plane_req_config_update':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:88:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr    88 |     struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                                ^~~~~~~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:87:27: warning: unused variable 'nv_dev' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr    87 |     struct nv_drm_device *nv_dev = to_nv_device(plane->dev);
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                           ^~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'plane_req_config_update':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:189:9: warning: unused variable 'ret' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   189 |     int ret = 0;
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |         ^~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_plane_atomic_set_property':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:504:32: warning: unused variable 'nv_drm_plane_state' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   504 |     struct nv_drm_plane_state *nv_drm_plane_state =
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                                ^~~~~~~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c: In function 'nv_drm_enumerate_crtcs_and_planes':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-crtc.c:1148:13: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr  1148 |             struct drm_plane *overlay_plane =
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |             ^~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c: In function '__will_generate_flip_event':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:98:10: warning: unused variable 'overlay_event' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr    98 |     bool overlay_event = false;
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |          ^~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:97:10: warning: unused variable 'primary_event' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr    97 |     bool primary_event = false;
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |          ^~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-drm/nvidia-drm-modeset.c:96:23: warning: unused variable 'primary_plane' [-Wunused-variable]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr    96 |     struct drm_plane *primary_plane = crtc->primary;
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |                       ^~~~~~~~~~~~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-peermem/nvidia-peermem.c: In function 'nv_mem_client_init':
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr /usr/src/nvidia-515.65.01/kernel/nvidia-peermem/nvidia-peermem.c:445:5: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr   445 |     int status = 0;
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr       |     ^~~
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Relinking NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Building NVIDIA driver package nvidia-modules-5.4.0-128...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installing NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr WARNING: The nvidia-drm module will not be installed. As a result, DRM-KMS will not function with this installation of the NVIDIA driver.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ERROR: Unable to open 'kernel/dkms.conf' for copying (No such file or directory)
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Welcome to the NVIDIA Software Installer for Unix/Linux
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Detected 16 CPUs online; setting concurrency level to 16.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installing NVIDIA driver version 515.65.01.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing CC sanity check with CC="/usr/bin/cc".
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing CC check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Kernel source path: '/lib/modules/5.4.0-128-generic/build'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Kernel output path: '/lib/modules/5.4.0-128-generic/build'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing Compiler check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing Dom0 check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing Xen check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing PREEMPT_RT check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Performing vgpu_kvm check.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Cleaning kernel module build directory.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Building kernel modules
  : [##############################] 100%er-ctr   : [                              ]   0%
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Kernel module compilation complete.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Unable to determine if Secure Boot is enabled: No such file or directory
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Kernel messages:
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761171.103160] audit: type=1107 audit(1666621940.206:25435): pid=963 uid=103 auid=4294967295 ses=4294967295 msg='apparmor="DENIED" operation="dbus_method_call"  bus="system" path="/org/freedesktop/DBus" interface="org.freedesktop.DBus" member="Hello" mask="send" name="org.freedesktop.DBus" pid=1741 label="cri-containerd.apparmor.d" peer_label="unconfined"
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761171.754529] nvidia-modeset: Unloading
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761172.117175] nvidia-uvm: Unloaded the UVM driver.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761172.160661] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761173.975343] IPv6: ADDRCONF(NETDEV_CHANGE): calia9990a4afdf: link becomes ready
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761201.101161] audit: type=1107 audit(1666621970.206:25436): pid=963 uid=103 auid=4294967295 ses=4294967295 msg='apparmor="DENIED" operation="dbus_method_call"  bus="system" path="/org/freedesktop/DBus" interface="org.freedesktop.DBus" member="Hello" mask="send" name="org.freedesktop.DBus" pid=1741 label="cri-containerd.apparmor.d" peer_label="unconfined"
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761231.118306] audit: type=1107 audit(1666622000.221:25437): pid=963 uid=103 auid=4294967295 ses=4294967295 msg='apparmor="DENIED" operation="dbus_method_call"  bus="system" path="/org/freedesktop/DBus" interface="org.freedesktop.DBus" member="Hello" mask="send" name="org.freedesktop.DBus" pid=1741 label="cri-containerd.apparmor.d" peer_label="unconfined"
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr                  exe="/usr/bin/dbus-daemon" sauid=103 hostname=? addr=? terminal=?'
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761233.227409] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761233.227437] IPv6: ADDRCONF(NETDEV_CHANGE): cali390c8542097: link becomes ready
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761252.658964] IPv6: ADDRCONF(NETDEV_CHANGE): calia50f51f2b1c: link becomes ready
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.178526] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.179911] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.224728] nvidia 0000:02:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.272339] nvidia 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.313658] nvidia 0000:04:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=none:owns=none
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.360701] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  515.65.01  Wed Jul 20 14:00:58 UTC 2022
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.370085] nvidia-uvm: Loaded the UVM driver, major device number 235.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.372202] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.65.01  Wed Jul 20 13:43:59 UTC 2022
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.377666] nvidia-modeset: Unloading
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.497966] nvidia-uvm: Unloaded the UVM driver.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr [761256.522141] nvidia-nvlink: Unregistered Nvlink Core, major device number 237
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installing 'NVIDIA Accelerated Graphics Driver for Linux-x86_64' (515.65.01):
  Installing: [##############################] 100%nstalling: [                              ]   0%
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Driver file installation is complete.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Running post-install sanity check:
  Checking: [##############################] 100% Checking: [                              ]   0%
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Post-install sanity check passed.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Installation of the kernel module for the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 515.65.01) is now complete.
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Parsing kernel module parameters...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Loading ipmi and i2c_core kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr + modprobe nvidia
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Loading NVIDIA driver kernel modules...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr + modprobe nvidia-uvm
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr + modprobe nvidia-modeset
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr + set +o xtrace -o nounset
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Starting NVIDIA persistence daemon...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr ls: cannot access '/proc/driver/nvidia-nvswitch/devices/*': No such file or directory
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Mounting NVIDIA driver rootfs...
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr Done, now waiting for signal
nvidia-driver-daemonset-c78b4 nvidia-driver-ctr rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused"

Note that while the driver errors, containerd functions normally and new pods (not using any gpu) can be successfully deployed on the node.

2. Steps to reproduce the issue

Unknown. After the node is running for a longer time, the cuda drivers cannot be injected anymore into pods. So far the only workaround for this problem is to restart the node. Afterward, the driver can be installed successfully.

3. Information to attach (optional if deemed irrelevant)

shivamerla commented 2 years ago

@wjentner Please find details about how driver pod restarts/upgrades are handled here. For drain failures you can tweak these parameters if that works for you. Otherwise node reboot is required.

driver:
  manager:
    env:
    - name: ENABLE_AUTO_DRAIN
       value: "true"
    - name: DRAIN_USE_FORCE
       value: "false"
    - name: DRAIN_POD_SELECTOR_LABEL
       value: ""
    - name: DRAIN_TIMEOUT_SECONDS
       value: "0s"
    - name: DRAIN_DELETE_EMPTYDIR_DATA
       value: "false"

Also the error nvidia-driver-daemonset-c78b4 nvidia-driver-ctr rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/containerd/containerd.sock: connect: connection refused" soon after driver install is expected as "container-toolkit" pod will reload containerd soon after driver is ready. Do you see issue in starting GPU pods after you see this error?

wjentner commented 2 years ago

Thank you for the explanation @shivamerla I adjusted the params.

Regarding the second problem: We are unsure what the cause is. After some time (i.e. 8 days of uptime on the node). New pods do not get injected with the drivers any more. CUDA is not working and also nvidia-smi is command not found. I created a test daemonset that schedules a pod on our GPU nodes which continuously executes nvidia-smi. In the latest outage I could observe something strange:

  1. The Pod was working normally, and I could execute into the Pod and call nvidia-smi.
  2. New Pods were no longer working, the injection mechanism did not work any more.
  3. When I manually deleted the Pod from the DaemonSet it showed the same behaviour as the other Pods and started crash looping.
  4. All Pods of the gpu-operator were running normally, there were nor errors in any of the logs.

Note that we did not upgrade the drivers nor did any anything else on the node. We do not know what caused this, but it has happened before.

To better see any new failures, I created a CronJob in addition to the DaemonSet that starts a Pod on GPU nodes and executes nvidia-smi. As of now, the uptime of the node is only 2.5 days and everything functions normally. From experiences, the problems start occurring after 10 to 15 days of uptime.

shivamerla commented 2 years ago

Thanks @wjentner. When this happens can you check if the mount and all files under /run/nvidia/driver are intact on the node. We should also debug from container-toolkit to understand why device/file injection is failing. Can you add following debug entries in /usr/local/nvidia/toolkit/nvidia-container-runtime/.config.toml as mentioned here.