NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

DamonSet creation fails on charmed-kubernetes #364

Open gschwim opened 2 years ago

gschwim commented 2 years ago

1. Quick Debug Checklist

1. Issue or feature description

Following the documented install procedure for gpu-operator on a fresh charmed-kubernetes install, I get the following error on the gpu-operator running on the node:

Couldn't create DaemonSet: ... Forbidden: disallowed by cluster policy

This results in no gpu resources becoming available to the cluster.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

NAMESPACE                         NAME                                                              READY   STATUS    RESTARTS        AGE
abn                               nginx-37a47647-899f6ff4c-v6mg6                                    1/1     Running   1 (3h15m ago)   6h7m
default                           cuda-vectoradd                                                    0/1     Pending   0               3h43m
gpu-operator                      gpu-operator-1656460182-node-feature-discovery-master-7d6cpjfwd   1/1     Running   0               11m
gpu-operator                      gpu-operator-1656460182-node-feature-discovery-worker-k2hd7       1/1     Running   0               11m
gpu-operator                      gpu-operator-77787587cf-57mgn                                     1/1     Running   0               11m
ingress-nginx-kubernetes-worker   default-http-backend-kubernetes-worker-6cd58d8886-h5xjl           1/1     Running   2 (2m38s ago)   6h7m
ingress-nginx-kubernetes-worker   nginx-ingress-controller-kubernetes-worker-kpfjm                  1/1     Running   1 (3h15m ago)   5h12m
kube-system                       coredns-5564855696-79vr9                                          1/1     Running   1 (3h15m ago)   6h7m
kube-system                       kube-state-metrics-5ccbcf64d5-2tqr7                               1/1     Running   1 (3h15m ago)   6h7m
kube-system                       metrics-server-v0.5.1-79b4746b65-sbbbl                            2/2     Running   2 (3h15m ago)   6h7m
kube-system                       tiller-deploy-74bcf4c66c-2vnlc                                    1/1     Running   0               141m
kubernetes-dashboard              dashboard-metrics-scraper-5cd54464bf-zf8b9                        1/1     Running   1 (3h15m ago)   6h7m
kubernetes-dashboard              kubernetes-dashboard-55796c99c-vnhlm                              1/1     Running   1 (3h15m ago)   6h7m
NAMESPACE                         NAME                                                    DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                        AGE
gpu-operator                      gpu-operator-1656460182-node-feature-discovery-worker   1         1         1       1            1           <none>                               14m
ingress-nginx-kubernetes-worker   nginx-ingress-controller-kubernetes-worker              1         1         1       1            1           juju-application=kubernetes-worker   11d

Pod cannot get a gpu resource. This works if I use system drivers.

Name:         cuda-vectoradd
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  kubernetes.io/psp: privileged
Status:       Pending
IP:
IPs:          <none>
Containers:
  cuda-vectoradd:
    Image:      nvidia/samples:vectoradd-cuda11.2.1
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ggcpf (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-ggcpf:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                     From               Message
  ----     ------            ----                    ----               -------
  Warning  FailedScheduling  117s (x216 over 3h47m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
1.6564608582434602e+09  INFO    controllers.ClusterPolicy   GPU workload configuration  {"NodeName": "number1", "GpuWorkloadConfig": "container"}
1.6564608582435489e+09  INFO    controllers.ClusterPolicy   Checking GPU state labels on the node   {"NodeName": "number1"}
1.6564608582435687e+09  INFO    controllers.ClusterPolicy   Number of nodes with GPU label  {"NodeCount": 1}
1.6564608582436178e+09  INFO    controllers.ClusterPolicy   Using container runtime: containerd
1.6564608582436502e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"RuntimeClass": "nvidia"}
1.6564608582491097e+09  INFO    controllers.ClusterPolicy   INFO: ClusterPolicy step completed  {"state:": "pre-requisites", "status": "ready"}
1.6564608582492526e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"Service": "gpu-operator", "Namespace": "gpu-operator"}
1.6564608582618704e+09  INFO    controllers.ClusterPolicy   INFO: ClusterPolicy step completed  {"state:": "state-operator-metrics", "status": "ready"}
1.6564608582673767e+09  INFO    controllers.ClusterPolicy   Found Resource, skipping update {"ServiceAccount": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582728472e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"Role": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582828317e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"ClusterRole": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608582917275e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"RoleBinding": "nvidia-driver", "Namespace": "gpu-operator"}
1.6564608583003638e+09  INFO    controllers.ClusterPolicy   Found Resource, updating... {"ClusterRoleBinding": "nvidia-driver", "Namespace": "gpu-operator"}
1.656460858304446e+09   INFO    controllers.ClusterPolicy   5.4.0-121-generic   {"Request.Namespace": "default", "Request.Name": "Node"}
1.656460858304628e+09   INFO    controllers.ClusterPolicy   DaemonSet not found, creating   {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Name": "nvidia-driver-daemonset"}
1.656460858309278e+09   INFO    controllers.ClusterPolicy   Couldn't create DaemonSet   {"DaemonSet": "nvidia-driver-daemonset", "Namespace": "gpu-operator", "Name": "nvidia-driver-daemonset", "Error": "DaemonSet.apps \"nvidia-driver-daemonset\" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]"}
1.6564608583093338e+09  ERROR   controller.clusterpolicy-controller Reconciler error    {"name": "cluster-policy", "namespace": "", "error": "DaemonSet.apps \"nvidia-driver-daemonset\" is invalid: [spec.template.spec.containers[0].securityContext.privileged: Forbidden: disallowed by cluster policy, spec.template.spec.initContainers[0].securityContext.privileged: Forbidden: disallowed by cluster policy]"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227

Does not exist

Does not exist

Does not exist

shivamerla commented 2 years ago

@gschwim Looks like PodSecurityPolicy admission controllers are enabled. You can install with --set psp.enabled=true so that we create and use appropriate PSP's with required permissions.

gschwim commented 2 years ago

Hi @shivamerla - Thanks for the reply. I did try --set psp.enabled=true on several of the testing iterations but this didn't appear to make any difference. Is there something that needs to be done in addition to this to take advantage of it?

shivamerla commented 2 years ago

@gschwim Can you run kubectl get psp and confirm PSP policies are created by GPU Operator. nvidia-driver serviceAccount is bound to the gpu-operator-privileged PSP which should allow this. Can you copy the error again with PSP enabled.