NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.7k stars 280 forks source link

nvidia-container-toolkit-daemonset: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount #315

Open joedborg opened 2 years ago

joedborg commented 2 years ago

1. Issue or feature description

This is mostly a question; why is https://github.com/NVIDIA/gpu-operator/blob/7441195aba0145dbbe8f8e4d43716e6c8e6186c2/assets/state-driver/0500_daemonset.yaml#L51-L53 set to mountPropagation: Bidirectional? If I remove this from the manifest, the pod sits in init forever, so I assume something useful is happening. Could this be explained please as we are having issues trying to get this to work inside mounted filesystems.

shivamerla commented 2 years ago

@joedborg This was done because mofed-validation pod writes a status file /run/nvidia/validations/mofed-ready once MOFED driver is running. If Bidirectional propagation is not used, then other pods waiting on this file would sit in the init phase. But currently no other pods have that wait condition yet. Which pod was in init when this was removed?

joedborg commented 2 years ago

Thanks for the reply @shivamerla, when removing mountPropagation: Bidirectional from nvidia-container-toolkit-daemonset, the following happens:

NAMESPACE                NAME                                                              READY   STATUS     RESTARTS   AGE
kube-system              pod/coredns-64c6478b6c-sckd6                                      1/1     Running    0          21m
kube-system              pod/calico-node-msfbm                                             1/1     Running    0          23m
kube-system              pod/calico-kube-controllers-784d4d4594-cbd8q                      1/1     Running    0          23m
default                  pod/gpu-operator-node-feature-discovery-worker-vlkf2              1/1     Running    0          20m
default                  pod/gpu-operator-node-feature-discovery-master-5f6fb954cf-zs244   1/1     Running    0          20m
gpu-operator-resources   pod/nvidia-operator-validator-h5jwb                               0/1     Init:0/4   0          19m
gpu-operator-resources   pod/nvidia-device-plugin-daemonset-jvm5g                          0/1     Init:0/1   0          19m
gpu-operator-resources   pod/nvidia-dcgm-wqld2                                             0/1     Init:0/1   0          19m
gpu-operator-resources   pod/nvidia-dcgm-exporter-rm85b                                    0/1     Init:0/1   0          19m
gpu-operator-resources   pod/gpu-feature-discovery-gcfjl                                   0/1     Init:0/1   0          19m
default                  pod/gpu-operator-7d9854fc59-dnh4q                                 1/1     Running    0          20m
gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-s7mv4                      0/1     Init:0/1   0          7m7s

NAMESPACE                NAME                                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
default                  service/kubernetes                                   ClusterIP   10.152.183.1     <none>        443/TCP                  23m
kube-system              service/kube-dns                                     ClusterIP   10.152.183.10    <none>        53/UDP,53/TCP,9153/TCP   21m
default                  service/gpu-operator-node-feature-discovery-master   ClusterIP   10.152.183.166   <none>        8080/TCP                 20m
default                  service/gpu-operator                                 ClusterIP   10.152.183.246   <none>        8080/TCP                 19m
gpu-operator-resources   service/nvidia-dcgm-exporter                         ClusterIP   10.152.183.84    <none>        9400/TCP                 19m

NAMESPACE                NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
  AGE
kube-system              daemonset.apps/calico-node                                  1         1         1       1            1           kubernetes.io/os=linux
  23m
default                  daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>
  20m
gpu-operator-resources   daemonset.apps/nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      19m
gpu-operator-resources   daemonset.apps/nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true
  19m
gpu-operator-resources   daemonset.apps/nvidia-dcgm                                  1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true
  19m
gpu-operator-resources   daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true
  19m
gpu-operator-resources   daemonset.apps/nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true
  19m
gpu-operator-resources   daemonset.apps/gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   19m
gpu-operator-resources   daemonset.apps/nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       19m

NAMESPACE     NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers                      1/1     1            1           23m
kube-system   deployment.apps/coredns                                      1/1     1            1           21m
default       deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           20m
default       deployment.apps/gpu-operator                                 1/1     1            1           20m

NAMESPACE     NAME                                                                    DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-784d4d4594                      1         1         1       23m
kube-system   replicaset.apps/coredns-64c6478b6c                                      1         1         1       21m
default       replicaset.apps/gpu-operator-node-feature-discovery-master-5f6fb954cf   1         1         1       20m
default       replicaset.apps/gpu-operator-7d9854fc59                                 1         1         1       20m

And the pod:

Name:                 nvidia-container-toolkit-daemonset-s7mv4
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-172-31-32-166/172.31.32.166
Start Time:           Thu, 24 Feb 2022 15:29:18 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=7b7945d586
                      pod-template-generation=2
Annotations:          cni.projectcalico.org/podIP: 10.1.217.7/32
                      cni.projectcalico.org/podIPs: 10.1.217.7/32
Status:               Pending
IP:                   10.1.217.7
IPs:
  IP:           10.1.217.7
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:  containerd://e90a5290af9849d31299cdf0925ab13beeb97d882ad95f5ceb5df294ca8c2299
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
    Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Running
      Started:      Thu, 24 Feb 2022 15:29:18 +0000
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rt4sr (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:
    Image:         nvcr.io/nvidia/k8s/container-toolkit:1.5.0-ubuntu18.04
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      /usr/local/nvidia
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      RUNTIME_ARGS:              --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/containerd-template.toml
      CONTAINERD_CONFIG:         /var/snap/microk8s/3027/args/containerd-template.toml
      CONTAINERD_SOCKET:         /var/snap/microk8s/common/run/containerd.sock
      NVIDIA_DRIVER_ROOT:        /
      RUNTIME:                   containerd
      CONTAINERD_RUNTIME_CLASS:  nvidia
    Mounts:
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from nvidia-local (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rt4sr (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /
    HostPathType:
  nvidia-local:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/3027/args
    HostPathType:
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/run
    HostPathType:
  kube-api-access-rt4sr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:                      <none>
shivamerla commented 2 years ago

@joedborg sorry for the delay, this is not expected. Looks like you have driver pre-installed and driver-validation is not succeeding. What does kubectl logs nvidia-container-toolkit-daemonset-s7mv4 -n gpu-operator-resources -c driver-validation show? Once this and toolkit setup validation passes(and creates /run/nvidia/validations/toolkit-ready file) rest of the components will run.

joedborg commented 2 years ago

I was running this with --set driver.enabled=false, if I run on a fresh system with no drivers preinstalled and --set driver.enabled=true, I see

NAMESPACE                NAME                                                              READY   STATUS                      RESTARTS   AGE
kube-system              pod/coredns-64c6478b6c-bpktm                                      1/1     Running                     0          4m50s
kube-system              pod/calico-kube-controllers-57b68d66b7-qh2fd                      1/1     Running                     0          5m54s
kube-system              pod/calico-node-96xt6                                             1/1     Running                     0          5m53s
kube-system              pod/hostpath-provisioner-7764447d7c-46b7l                         1/1     Running                     0          3m35s
default                  pod/gpu-operator-node-feature-discovery-master-5f6fb954cf-nz2tn   1/1     Running                     0          102s
default                  pod/gpu-operator-node-feature-discovery-worker-kkxc4              1/1     Running                     0          102s
gpu-operator-resources   pod/nvidia-operator-validator-cp9gc                               0/1     Init:0/4                    0          82s
gpu-operator-resources   pod/nvidia-device-plugin-daemonset-7tv52                          0/1     Init:0/1                    0          82s
gpu-operator-resources   pod/nvidia-dcgm-nv8nf                                             0/1     Init:0/1                    0          82s
gpu-operator-resources   pod/nvidia-dcgm-exporter-bxpqz                                    0/1     Init:0/1                    0          82s
gpu-operator-resources   pod/gpu-feature-discovery-jvttv                                   0/1     Init:0/1                    0          82s
default                  pod/gpu-operator-7d9854fc59-lhrhn                                 1/1     Running                     0          102s
gpu-operator-resources   pod/nvidia-container-toolkit-daemonset-98v9b                      0/1     Init:CreateContainerError   0          83s
gpu-operator-resources   pod/nvidia-driver-daemonset-2lzzl                                 0/1     Init:CreateContainerError   0          83s

NAMESPACE                NAME                                                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
default                  service/kubernetes                                   ClusterIP   10.152.183.1     <none>        443/TCP                  6m
kube-system              service/kube-dns                                     ClusterIP   10.152.183.10    <none>        53/UDP,53/TCP,9153/TCP   4m50s
default                  service/gpu-operator-node-feature-discovery-master   ClusterIP   10.152.183.36    <none>        8080/TCP                 102s
default                  service/gpu-operator                                 ClusterIP   10.152.183.230   <none>        8080/TCP                 83s
gpu-operator-resources   service/nvidia-dcgm-exporter                         ClusterIP   10.152.183.77    <none>        9400/TCP                 82s

NAMESPACE                NAME                                                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR
  AGE
kube-system              daemonset.apps/calico-node                                  1         1         1       1            1           kubernetes.io/os=linux
  5m58s
default                  daemonset.apps/gpu-operator-node-feature-discovery-worker   1         1         1       1            1           <none>
  102s
gpu-operator-resources   daemonset.apps/nvidia-container-toolkit-daemonset           1         1         0       1            0           nvidia.com/gpu.deploy.container-toolkit=true       83s
gpu-operator-resources   daemonset.apps/nvidia-driver-daemonset                      1         1         0       1            0           nvidia.com/gpu.deploy.driver=true
  83s
gpu-operator-resources   daemonset.apps/nvidia-operator-validator                    1         1         0       1            0           nvidia.com/gpu.deploy.operator-validator=true      83s
gpu-operator-resources   daemonset.apps/nvidia-device-plugin-daemonset               1         1         0       1            0           nvidia.com/gpu.deploy.device-plugin=true
  82s
gpu-operator-resources   daemonset.apps/nvidia-dcgm                                  1         1         0       1            0           nvidia.com/gpu.deploy.dcgm=true
  82s
gpu-operator-resources   daemonset.apps/nvidia-mig-manager                           0         0         0       0            0           nvidia.com/gpu.deploy.mig-manager=true
  82s
gpu-operator-resources   daemonset.apps/nvidia-dcgm-exporter                         1         1         0       1            0           nvidia.com/gpu.deploy.dcgm-exporter=true
  82s
gpu-operator-resources   daemonset.apps/gpu-feature-discovery                        1         1         0       1            0           nvidia.com/gpu.deploy.gpu-feature-discovery=true   82s

NAMESPACE     NAME                                                         READY   UP-TO-DATE   AVAILABLE   AGE
kube-system   deployment.apps/calico-kube-controllers                      1/1     1            1           5m58s
kube-system   deployment.apps/coredns                                      1/1     1            1           4m50s
kube-system   deployment.apps/hostpath-provisioner                         1/1     1            1           4m20s
default       deployment.apps/gpu-operator-node-feature-discovery-master   1/1     1            1           102s
default       deployment.apps/gpu-operator                                 1/1     1            1           102s

NAMESPACE     NAME                                                                    DESIRED   CURRENT   READY   AGE
kube-system   replicaset.apps/calico-kube-controllers-57b68d66b7                      1         1         1       5m54s
kube-system   replicaset.apps/coredns-64c6478b6c                                      1         1         1       4m50s
kube-system   replicaset.apps/hostpath-provisioner-7764447d7c                         1         1         1       3m35s
default       replicaset.apps/gpu-operator-node-feature-discovery-master-5f6fb954cf   1         1         1       102s
default       replicaset.apps/gpu-operator-7d9854fc59                                 1         1         1       102s

But I cannot see the logs because it's in error:

Error from server (BadRequest): container "driver-validation" in pod "nvidia-container-toolkit-daemonset-98v9b" is waiting to start: CreateContainerError

And the description:

Name:                 nvidia-container-toolkit-daemonset-98v9b
Namespace:            gpu-operator-resources
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 ip-172-31-93-86/172.31.93.86
Start Time:           Thu, 10 Mar 2022 15:35:21 +0000
Labels:               app=nvidia-container-toolkit-daemonset
                      controller-revision-hash=c9b97d994
                      pod-template-generation=1
Annotations:          cni.projectcalico.org/podIP: 10.1.42.8/32
                      cni.projectcalico.org/podIPs: 10.1.42.8/32
Status:               Pending
IP:                   10.1.42.8
IPs:
  IP:           10.1.42.8
Controlled By:  DaemonSet/nvidia-container-toolkit-daemonset
Init Containers:
  driver-validation:
    Container ID:
    Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      nvidia-validator
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:
      WITH_WAIT:  true
      COMPONENT:  driver
    Mounts:
      /run/nvidia/driver from driver-install-path (rw)
      /run/nvidia/validations from run-nvidia-validations (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c756f (ro)
Containers:
  nvidia-container-toolkit-ctr:
    Container ID:
    Image:         nvcr.io/nvidia/k8s/container-toolkit:1.5.0-ubuntu18.04
    Image ID:
    Port:          <none>
    Host Port:     <none>
    Args:
      /usr/local/nvidia
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      RUNTIME_ARGS:              --socket /runtime/sock-dir/containerd.sock --config /runtime/config-dir/containerd-template.toml
      CONTAINERD_CONFIG:         /var/snap/microk8s/3027/args/containerd-template.toml
      CONTAINERD_SOCKET:         /var/snap/microk8s/common/run/containerd.sock
      NVIDIA_DRIVER_ROOT:        /run/nvidia/driver
      RUNTIME:                   containerd
      CONTAINERD_RUNTIME_CLASS:  nvidia
    Mounts:
      /run/nvidia from nvidia-run-path (rw)
      /runtime/config-dir/ from containerd-config (rw)
      /runtime/sock-dir/ from containerd-socket (rw)
      /usr/local/nvidia from nvidia-local (rw)
      /usr/share/containers/oci/hooks.d from crio-hooks (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-c756f (ro)
Conditions:
  Type              Status
  Initialized       False
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  nvidia-run-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia
    HostPathType:  DirectoryOrCreate
  run-nvidia-validations:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/validations
    HostPathType:  DirectoryOrCreate
  driver-install-path:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/driver
    HostPathType:
  nvidia-local:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/local/nvidia
    HostPathType:
  crio-hooks:
    Type:          HostPath (bare host directory volume)
    Path:          /run/containers/oci/hooks.d
    HostPathType:
  containerd-config:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/3027/args
    HostPathType:
  containerd-socket:
    Type:          HostPath (bare host directory volume)
    Path:          /var/snap/microk8s/common/run
    HostPathType:
  kube-api-access-c756f:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              nvidia.com/gpu.deploy.container-toolkit=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  2m49s                 default-scheduler  Successfully assigned gpu-operator-resources/nvidia-container-toolkit-daemonset-98v9b to ip-172-31-93-86
  Normal   Pulling    2m48s                 kubelet            Pulling image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2"
  Normal   Pulled     2m34s                 kubelet            Successfully pulled image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" in 14.04703106s
  Warning  Failed     2m34s                 kubelet            Error: failed to generate container "4f9f599983966384f285f14cada1b56b7697cb05d5219fcba3f815d0fc3249b2" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     2m34s                 kubelet            Error: failed to generate container "9ae463e9eb291597c64191ffa175fb1dd8ab72439c7d7335b11782cbe3efc24c" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     2m19s                 kubelet            Error: failed to generate container "66d33c3b7143ed00aae5ccb9e3bdf9214c289501a1d9983de20aa06e18976644" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     2m6s                  kubelet            Error: failed to generate container "76f3088ba6376929b437f708efd4bb88e8e41a4e1f9808a3dd0fe4f60649c816" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     111s                  kubelet            Error: failed to generate container "cfe8cf6a07f43d3eccfd801b97b5a42cb9e7562fe77a8853d94687d092abfd8b" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     98s                   kubelet            Error: failed to generate container "9bbc6e7d6e62c584ca31e07b027ba13396d738c41e518a8d944efc2f6f89a107" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     84s                   kubelet            Error: failed to generate container "bd60deaa5d50ac280454018460344a3d0414bd32568f7c489794e7f29c949ca0" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     70s                   kubelet            Error: failed to generate container "26d1b9c36e832bd277d18181da753434c39d928d188296c1a0f12fbac02eb401" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Warning  Failed     55s                   kubelet            Error: failed to generate container "31cd1f32e0afa96a2f79ed30b4e32f5305d9ef384202141908a59a80b46578f9" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
  Normal   Pulled     17s (x11 over 2m34s)  kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
  Warning  Failed     17s (x3 over 44s)     kubelet            (combined from similar events): Error: failed to generate container "6fbbe3224db75958b67a2cb588abe907494a959b5f8f9cb2e8da58a89c357cab" spec: failed to generate spec: path "/run/nvidia/validations" is mounted on "/run" but it is not a shared mount
shivamerla commented 2 years ago

@joedborg the reason we had to add /run/nvidia/validations with BiDirectional mountPropagation is, toolkit initContainer and operatorValidator initContainers will write bunch of status files there and should be accessible to other containers like device-plugin, gfd, dcgm etc. With that mode, the status files were not getting updated onto other containers until re-start. Is this causing an issue with microK8s?