NVIDIA / k8s-dra-driver

Dynamic Resource Allocation (DRA) for NVIDIA GPUs in Kubernetes
Apache License 2.0
212 stars 38 forks source link

DRA driver is not able to allocate gpu #46

Closed parthyadav3105 closed 7 months ago

parthyadav3105 commented 7 months ago

I am working with a kubeadm cluster on a baremetal server with 3 Nvidia A40 gpu.

Problem: The gpu.nvidia.com driver is not able to allocate gpu.

My deployment:

apiVersion: resource.k8s.io/v1alpha2
kind: ResourceClaimTemplate
metadata:
  name: random
spec:
  spec:
    resourceClassName: gpu.nvidia.com

---
apiVersion: v1
kind: Pod
metadata:
  name: dra-sample
spec:
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["sleep", "10000"]
    resources:
      claims:
      - name: gpu
  resourceClaims:
  - name: gpu
    source:
      resourceClaimTemplateName: random

Here are Outputs for debugging:

$ kubectl describe pod dra-sample

Name:             dra-sample
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Containers:
  ctr:
    Image:      ubuntu:22.04
    Port:       <none>
    Host Port:  <none>
    Command:
      sleep
      10000
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-x52xb (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-x52xb:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  25s   default-scheduler  0/1 nodes are available: 1 waiting for resource driver to allocate resource.
  Warning  FailedScheduling  23s   default-scheduler  0/1 nodes are available: 1 waiting for resource driver to provide information.

$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s

I0103 07:33:19.351405       1 controller.go:295] "resource controller: Starting" driver="gpu.resource.nvidia.com"
I0103 07:33:19.351515       1 reflector.go:287] Starting reflector *v1alpha2.ResourceClaim (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351519       1 reflector.go:287] Starting reflector *v1alpha2.PodSchedulingContext (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351538       1 reflector.go:323] Listing and watching *v1alpha2.ResourceClaim from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351544       1 reflector.go:323] Listing and watching *v1alpha2.PodSchedulingContext from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351555       1 reflector.go:287] Starting reflector *v1alpha2.ResourceClass (0s) from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.351572       1 reflector.go:323] Listing and watching *v1alpha2.ResourceClass from k8s.io/client-go/informers/factory.go:150
I0103 07:33:19.357423       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357445       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.357459       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?limit=500&resourceVersion=0 200 OK in 5 milliseconds
I0103 07:33:19.358836       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/podschedulingcontexts?allowWatchBookmarks=true&resourceVersion=1463537&timeout=7m2s&timeoutSeconds=422&watch=true 200 OK in 0 milliseconds
I0103 07:33:19.358847       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclaims?allowWatchBookmarks=true&resourceVersion=1463541&timeout=5m15s&timeoutSeconds=315&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.359089       1 round_trippers.go:553] GET https://10.96.0.1:443/apis/resource.k8s.io/v1alpha2/resourceclasses?allowWatchBookmarks=true&resourceVersion=1463537&timeout=9m53s&timeoutSeconds=593&watch=true 200 OK in 1 milliseconds
I0103 07:33:19.452100       1 shared_informer.go:344] caches populated
I0103 07:33:27.927510       1 controller.go:241] "resource controller: new object" type="ResourceClaim" content="{\"metadata\":{\"name\":\"dra-sample-gpu-vgbkc\",\"generateName\":\"dra-sample-gpu-\",\"namespace\":\"default\",\"uid\":\"cf3f2558-4f63-402a-a7c3-333b7d928885\",\"resourceVersion\":\"1463627\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"annotations\":{\"resource.kubernetes.io/pod-claim-name\":\"gpu\"},\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-controller-manager\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:resource.kubernetes.io/pod-claim-name\":{}},\"f:generateName\":{},\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:allocationMode\":{},\"f:resourceClassName\":{}}}}]},\"spec\":{\"resourceClassName\":\"gpu.nvidia.com\",\"allocationMode\":\"WaitForFirstConsumer\"},\"status\":{}}"
I0103 07:33:27.927541       1 controller.go:260] "resource controller: Adding new work item" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927583       1 controller.go:332] "resource controller: processing" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927603       1 controller.go:476] "resource controller: ResourceClaim waiting for first consumer" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.927613       1 controller.go:336] "resource controller: completed" key="claim:default/dra-sample-gpu-vgbkc"
I0103 07:33:27.935496       1 controller.go:241] "resource controller: new object" type="PodSchedulingContext" content="{\"metadata\":{\"name\":\"dra-sample\",\"namespace\":\"default\",\"uid\":\"e8b1fa44-574f-4a92-bdd9-26e33f816689\",\"resourceVersion\":\"1463629\",\"creationTimestamp\":\"2024-01-03T07:33:27Z\",\"ownerReferences\":[{\"apiVersion\":\"v1\",\"kind\":\"Pod\",\"name\":\"dra-sample\",\"uid\":\"803e19c7-41d6-4331-94a2-e86b3c32b177\",\"controller\":true,\"blockOwnerDeletion\":true}],\"managedFields\":[{\"manager\":\"kube-scheduler\",\"operation\":\"Update\",\"apiVersion\":\"resource.k8s.io/v1alpha2\",\"time\":\"2024-01-03T07:33:27Z\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:ownerReferences\":{\".\":{},\"k:{\\\"uid\\\":\\\"803e19c7-41d6-4331-94a2-e86b3c32b177\\\"}\":{}}},\"f:spec\":{\"f:potentialNodes\":{},\"f:selectedNode\":{}}}}]},\"spec\":{\"selectedNode\":\"nm-shakti-worker6\",\"potentialNodes\":[\"nm-shakti-worker6\"]},\"status\":{}}"
I0103 07:33:27.935520       1 controller.go:260] "resource controller: Adding new work item" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.935547       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.937738       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 2 milliseconds
I0103 07:33:27.939722       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939739       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:27.939747       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.940731       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.944575       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 3 milliseconds
I0103 07:33:57.945164       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945186       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:33:57.945198       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.946163       1 controller.go:332] "resource controller: processing" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.950857       1 round_trippers.go:553] GET https://10.96.0.1:443/api/v1/namespaces/default/pods/dra-sample 200 OK in 4 milliseconds
I0103 07:34:27.951558       1 controller.go:390] "resource controller: ResourceClaim not found, no need to process it" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951578       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"
I0103 07:34:27.951589       1 controller.go:342] "resource controller: recheck periodically" key="schedulingCtx:default/dra-sample"

$ kubectl logs -n gpu-operator nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm

Defaulted container "plugin" out of: plugin, init (init)
I0103 07:33:20.280685       1 device_state.go:142] using devRoot=/driver-root
I0103 07:33:20.291222       1 nonblockinggrpcserver.go:107] "dra: GRPC server started"
I0103 07:33:20.291293       1 nonblockinggrpcserver.go:107] "registrar: GRPC server started"

We can see the node is not able to pick up gpu resources: $ kubectl describe node nm-shakti-worker6

Name:               nm-shakti-worker6
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.CETSS=true
                    feature.node.kubernetes.io/cpu-cpuid.CLZERO=true
                    feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
                    feature.node.kubernetes.io/cpu-cpuid.CPBOOST=true
                    feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.FP256=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSR=true
                    feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBPB=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED=true
                    feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSFFV=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT=true
                    feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE=true
                    feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST=true
                    feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.INVLPGB=true
                    feature.node.kubernetes.io/cpu-cpuid.LAHF=true
                    feature.node.kubernetes.io/cpu-cpuid.LBRVIRT=true
                    feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW=true
                    feature.node.kubernetes.io/cpu-cpuid.MCOMMIT=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
                    feature.node.kubernetes.io/cpu-cpuid.MOVU=true
                    feature.node.kubernetes.io/cpu-cpuid.MSRIRC=true
                    feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH=true
                    feature.node.kubernetes.io/cpu-cpuid.NRIPS=true
                    feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.PPIN=true
                    feature.node.kubernetes.io/cpu-cpuid.PSFD=true
                    feature.node.kubernetes.io/cpu-cpuid.RDPRU=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_ES=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED=true
                    feature.node.kubernetes.io/cpu-cpuid.SEV_SNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SHA=true
                    feature.node.kubernetes.io/cpu-cpuid.SME=true
                    feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT=true
                    feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
                    feature.node.kubernetes.io/cpu-cpuid.SSE4A=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP=true
                    feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON=true
                    feature.node.kubernetes.io/cpu-cpuid.SUCCOR=true
                    feature.node.kubernetes.io/cpu-cpuid.SVM=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMDA=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMFBASID=true
                    feature.node.kubernetes.io/cpu-cpuid.SVML=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMNP=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPF=true
                    feature.node.kubernetes.io/cpu-cpuid.SVMPFT=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
                    feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
                    feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED=true
                    feature.node.kubernetes.io/cpu-cpuid.TOPEXT=true
                    feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR=true
                    feature.node.kubernetes.io/cpu-cpuid.VAES=true
                    feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN=true
                    feature.node.kubernetes.io/cpu-cpuid.VMPL=true
                    feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT=true
                    feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
                    feature.node.kubernetes.io/cpu-cpuid.VTE=true
                    feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true
                    feature.node.kubernetes.io/cpu-cpuid.X87=true
                    feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
                    feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/cpu-model.family=25
                    feature.node.kubernetes.io/cpu-model.id=1
                    feature.node.kubernetes.io/cpu-model.vendor_id=AMD
                    feature.node.kubernetes.io/cpu-rdt.RDTCMT=true
                    feature.node.kubernetes.io/cpu-rdt.RDTL3CA=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMBM=true
                    feature.node.kubernetes.io/cpu-rdt.RDTMON=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-config.PREEMPT=true
                    feature.node.kubernetes.io/kernel-version.full=5.15.0-91-lowlatency
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=15
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/memory-numa=true
                    feature.node.kubernetes.io/network-sriov.capable=true
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-10de.sriov.capable=true
                    feature.node.kubernetes.io/pci-1a03.present=true
                    feature.node.kubernetes.io/pci-8086.present=true
                    feature.node.kubernetes.io/pci-8086.sriov.capable=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    feature.node.kubernetes.io/usb-ef_0b1f_03ee.present=true
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=nm-shakti-worker6
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
                    nvidia.com/cuda.driver.major=525
                    nvidia.com/cuda.driver.minor=147
                    nvidia.com/cuda.driver.rev=05
                    nvidia.com/cuda.runtime.major=12
                    nvidia.com/cuda.runtime.minor=0
                    nvidia.com/dra.controller=true
                    nvidia.com/dra.kubelet-plugin=true
                    nvidia.com/gfd.timestamp=1704222592
                    nvidia.com/gpu-driver-upgrade-state=upgrade-done
                    nvidia.com/gpu.compute.major=8
                    nvidia.com/gpu.compute.minor=6
                    nvidia.com/gpu.count=3
                    nvidia.com/gpu.deploy.container-toolkit=true
                    nvidia.com/gpu.deploy.dcgm=true
                    nvidia.com/gpu.deploy.dcgm-exporter=true
                    nvidia.com/gpu.deploy.device-plugin=true
                    nvidia.com/gpu.deploy.driver=pre-installed
                    nvidia.com/gpu.deploy.gpu-feature-discovery=true
                    nvidia.com/gpu.deploy.node-status-exporter=true
                    nvidia.com/gpu.deploy.operator-validator=true
                    nvidia.com/gpu.family=ampere
                    nvidia.com/gpu.machine=AS--4124GS-TNR
                    nvidia.com/gpu.memory=49140
                    nvidia.com/gpu.present=true
                    nvidia.com/gpu.product=NVIDIA-A40
                    nvidia.com/gpu.replicas=1
                    nvidia.com/mig.capable=false
                    nvidia.com/mig.strategy=single
Annotations:        csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"nm-shakti-worker6"}
                    flannel.alpha.coreos.com/backend-data: {"VNI":1,"VtepMAC":"42:b2:da:55:59:b8"}
                    flannel.alpha.coreos.com/backend-type: vxlan
                    flannel.alpha.coreos.com/kube-subnet-manager: true
                    flannel.alpha.coreos.com/public-ip: 10.82.14.19
                    kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.CETSS,cpu-cpuid.CLZERO,cpu-cpuid.CMPXCHG8,cpu-cpuid.CPBOOST,cpu-cpuid...
                    nfd.node.kubernetes.io/master.version: v0.14.2
                    nfd.node.kubernetes.io/worker.version: v0.14.2
                    node.alpha.kubernetes.io/ttl: 0
                    nvidia.com/gpu-driver-upgrade-enabled: true
                    projectcalico.org/IPv4Address: 10.82.14.19/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.244.170.0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Tue, 26 Dec 2023 10:42:19 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  nm-shakti-worker6
  AcquireTime:     <unset>
  RenewTime:       Wed, 03 Jan 2024 07:36:56 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Tue, 02 Jan 2024 19:08:58 +0000   Tue, 02 Jan 2024 19:08:58 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Wed, 03 Jan 2024 07:34:36 +0000   Tue, 26 Dec 2023 10:42:17 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Wed, 03 Jan 2024 07:34:36 +0000   Tue, 02 Jan 2024 17:18:27 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  10.82.14.19
  Hostname:    nm-shakti-worker6
Capacity:
  cpu:                256
  ephemeral-storage:  459778128Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528139536Ki
  nvidia.com/gpu:     0
  pods:               110
Allocatable:
  cpu:                256
  ephemeral-storage:  423731522064
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             528037136Ki
  nvidia.com/gpu:     0
  pods:               110
System Info:
  Machine ID:                 604c16e1b6dc4bf182c91ec14fcce1bd
  System UUID:                7f12d000-0f90-11ed-8000-3cecefeab242
  Boot ID:                    29e635b2-b1cc-491b-af4a-3930fc2bd8d8
  Kernel Version:             5.15.0-91-lowlatency
  OS Image:                   Ubuntu 22.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.14
  Kubelet Version:            v1.29.0
  Kube-Proxy Version:         v1.29.0
PodCIDR:                      10.244.0.0/24
PodCIDRs:                     10.244.0.0/24
Non-terminated Pods:          (24 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  calico-apiserver            calico-apiserver-6598988b78-8bl8p                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-apiserver            calico-apiserver-6598988b78-zq5lh                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-kube-controllers-779fc55954-hbvlw                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-node-hzpnc                                                  0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               calico-typha-6fd5cc6495-wr86s                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  calico-system               csi-node-driver-nqdwf                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
  gpu-operator                gpu-feature-discovery-cdp92                                        0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd    0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-1703607135-node-feature-discovery-worker-qck97        0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                gpu-operator-99d8f4cd7-vn45x                                       200m (0%)     500m (0%)   100Mi (0%)       350Mi (0%)     7d15h
  gpu-operator                nvidia-container-toolkit-daemonset-jqr8v                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                nvidia-dcgm-exporter-khgjc                                         0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d15h
  gpu-operator                nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s    0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m46s
  gpu-operator                nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm       0 (0%)        0 (0%)      0 (0%)           0 (0%)         3m45s
  gpu-operator                nvidia-operator-validator-dtxs2                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         9h
  kube-system                 coredns-76f75df574-lqdtc                                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     7d15h
  kube-system                 coredns-76f75df574-xzmqw                                           100m (0%)     0 (0%)      70Mi (0%)        170Mi (0%)     7d15h
  kube-system                 etcd-nm-shakti-worker6                                             100m (0%)     0 (0%)      100Mi (0%)       0 (0%)         7d20h
  kube-system                 kube-apiserver-nm-shakti-worker6                                   250m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 kube-controller-manager-nm-shakti-worker6                          200m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  kube-system                 kube-proxy-t7gxs                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         7d20h
  kube-system                 kube-scheduler-nm-shakti-worker6                                   100m (0%)     0 (0%)      0 (0%)           0 (0%)         14h
  tigera-operator             tigera-operator-55585899bf-lcljn                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         5d19h
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1050m (0%)  500m (0%)
  memory             340Mi (0%)  690Mi (0%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
  nvidia.com/gpu     0           0
Events:              <none>

$ kubectl describe -n gpu-operator nodeallocationstates.nas.gpu.resource.nvidia.com nm-shakti-worker6

Name:         nm-shakti-worker6
Namespace:    gpu-operator
Labels:       <none>
Annotations:  <none>
API Version:  nas.gpu.resource.nvidia.com/v1alpha1
Kind:         NodeAllocationState
Metadata:
  Creation Timestamp:  2024-01-02T21:04:04Z
  Generation:          15
  Owner References:
    API Version:     v1
    Kind:            Node
    Name:            nm-shakti-worker6
    UID:             7195b117-ec11-443e-ba2e-c84bb407708e
  Resource Version:  1463603
  UID:               39a10030-b2bd-4bd4-8a8b-4c5bb7c757c7
Spec:
  Allocatable Devices:
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    0
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-e060a342-afa6-2f46-7342-52ab49773d47
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    1
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-e72a049f-0e52-1f3a-4e93-fac0c3ecfe50
    Gpu:
      Architecture:             Ampere
      Brand:                    Nvidia
      Cuda Compute Capability:  8.6
      Index:                    2
      Memory Bytes:             51527024640
      Mig Enabled:              false
      Product Name:             NVIDIA A40
      Uuid:                     GPU-3e9c1e0e-50e4-5121-70dc-438753eeaa1c
Status:                         Ready
Events:                         <none>

How I installed k8s-dra-driver?

To fulfill prerequisites, I simply used gpu-operator as I already had it working in cluster. Just disabled devicePlugin.

helm upgrade --install gpu-operator-1703607135 --namespace gpu-operator nvidia/gpu-operator --set devicePlugin.enabled=false

For installing NVIDIA/k8s-dra-driver, I followed script(demo/clusters/kind/scripts/build-driver-image.sh) to build image and then ran:

helm upgrade --install nvidia-dra-device-plugin --namespace gpu-operator k8s-dra-driver/

Hence, it gave me: $ kubectl get pods -n gpu-operator

NAME                                                              READY   STATUS      RESTARTS      AGE
gpu-feature-discovery-cdp92                                       1/1     Running     1 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-gc-796f559d2vwht   1/1     Running     2 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-master-59bf42qtd   1/1     Running     2 (12h ago)   7d15h
gpu-operator-1703607135-node-feature-discovery-worker-qck97       1/1     Running     3 (12h ago)   7d15h
gpu-operator-99d8f4cd7-vn45x                                      1/1     Running     4 (12h ago)   7d15h
nvidia-container-toolkit-daemonset-jqr8v                          1/1     Running     2 (12h ago)   7d15h
nvidia-cuda-validator-n2ltp                                       0/1     Completed   0             9h
nvidia-dcgm-exporter-khgjc                                        1/1     Running     1 (12h ago)   7d15h
nvidia-dra-device-plugin-k8s-dra-driver-controller-c8d57bb4fw6s   1/1     Running     0             7m33s
nvidia-dra-device-plugin-k8s-dra-driver-kubelet-plugin-7txkm      1/1     Running     0             7m32s
nvidia-operator-validator-dtxs2                                   1/1     Running     0             9h

The step-up works fine with default driver from gpu-operator. But it fails to allocate and find gpu with NVIDIA/k8s-dra-driver. My installation is probably wrong. What did I miss?

Thanks!

parthyadav3105 commented 7 months ago

I am using kubernetes v1.29.0

I0103 07:34:27.951578       1 controller.go:658] "resource controller: Found no pending pod claims" key="schedulingCtx:default/dra-sample"

The above log from nvidia-dra-controller occurs because from kubernetes v1.28+ onwards ResourceClaim created from ResourceClaimTemplate are created using metadata.generatedName(113722, 117351) to prevent name conflicts.

NVIDIA/k8s-dra-driver repository is currently using older v0.27.0-beta.0 version of package k8s.io/dynamic-resource-allocation which does not handles this. Updating driver to use version v0.28.0+ for k8s.io/dynamic-resource-allocation will fix the issue.

Hence closing.

parthyadav3105 commented 7 months ago

@klueska, @elezar I can add this fix along with support for batch allocation in the driver if that does not affect any ongoing roadmaps.