AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

not able to find dev with index #175

Open southquist opened 2 years ago

southquist commented 2 years ago

Hello everyone,

I'm using the device plugin with a machine that has 8 GPUs. I am able to use GPU 0-3 without any issue. but pods scheduled to GPU indexes 4-7 fails to start.

Some more info on my setup.

Error from kubectl describe pod

  Warning  Failed     0s (x3 over 19s)   kubelet            Error: failed to create containerd task: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init
caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: no-gpu-has-7MiB-to-run: unknown device: unknown

And this is the error from the gpushare-device-plugin logs.

W0503 15:04:17.958722       1 allocate.go:101] Failed to find the dev for pod &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:tensorflow-deployment-5df95f79df-bpsbb,GenerateName:tensorflow-deployment-5df
95f79df-,Namespace:default,SelfLink:,UID:4783f5d8-0300-4561-88de-278f2ce748e9,ResourceVersion:208676685,Generation:0,CreationTimestamp:2022-05-03 15:04:14 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labe
ls:map[string]string{app: carla-server,pod-template-hash: 5df95f79df,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: false,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1651590257533230558,ALIYUN_COM_GPU_MEM_DEV: 22,ALIYUN_CO
M_GPU_MEM_IDX: 4,ALIYUN_COM_GPU_MEM_POD: 7,kubernetes.io/psp: global-unrestricted-psp,},OwnerReferences:[{apps/v1 ReplicaSet tensorflow-deployment-5df95f79df 92bc5776-a51e-4066-949d-7c183b5d2d57 0xc42080fcc9 0xc42080fcca}],F
inalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{kube-api-access-hhhl6 {nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ProjectedVolumeSource{Sources:[{nil nil n
il ServiceAccountTokenProjection{Audience:,ExpirationSeconds:*3607,Path:token,}} {nil nil &ConfigMapProjection{LocalObjectReference:LocalObjectReference{Name:kube-root-ca.crt,},Items:[{ca.crt ca.crt <nil>}],Optional:nil,} ni
l} {nil &DownwardAPIProjection{Items:[{namespace ObjectFieldSelector{APIVersion:v1,FieldPath:metadata.namespace,} nil <nil>}],} nil nil}],DefaultMode:*420,} nil nil nil}}],Containers:[{carla example.com:7586/carla/carla-u20r\
12-master:123 [/bin/bash] [-c unset SDL_VIDEODRIVER & /home/carla/CarlaUE4.sh -vulkan -RenderOffscreen -graphicsadapter=0 -nosound -carla-rpc-port=2000]  [{ 0 2000 TCP }
 { 0 2001 TCP }] [] [] {map[aliyun.com/gpu-mem:{{7 0} {<nil>} 7 DecimalSI}] map[aliyun.com/gpu-mem:{{7 0} {<nil>} 7 DecimalSI}]} [{kube-api-access-hhhl6 true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil
nil /dev/termination-log File IfNotPresent nil false false false} {bridge example.com:7586/kubernetes/carla-bridge-small:0.1 [sleep] [infinity]  [{ 0 63136 TCP } { 0 63137 TCP }] [] [{USER carla nil} {CA
RLA_SERVER_HOST localhost nil} {CARLA_SERVER_PORT 2000 nil}] {map[] map[]} [{kube-api-access-hhhl6 true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false f
alse false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{kubernetes.io/hostname: gpu-node1,},ServiceAccountName:default,DeprecatedSe
rviceAccount:default,NodeName:gpu-node1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sy
sctls:[],},ImagePullSecrets:[{gitlab-credentials}],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{nvidia.com/gpu Exists  NoSchedule <nil>} {
node.kubernetes.io/not-ready Exists  NoExecute 0xc4209fe360} {node.kubernetes.io/unreachable Exists  NoExecute 0xc4209fe380}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGat
es:[],},Status:PodStatus{Phase:Pending,Conditions:[{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-05-03 15:04:17 +0000 UTC  }],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,I
nitContainerStatuses:[],NominatedNodeName:,},} because it's not able to find dev with index 4

But I do see all the GPUs just fine with kubectl-inspect-gpushare

NAME        IPADDRESS      GPU0(Allocated/Total)  GPU1(Allocated/Total)  GPU2(Allocated/Total)  GPU3(Allocated/Total)  GPU4(Allocated/Total)  GPU5(Allocated/Total)  GPU6(Allocated/Total)  GPU7(Allocated/Total)  GPU Memory(GiB)
gpu-node1   ***.**.***.**  21/22                  21/22                  21/22                  21/22                  21/22                  21/22                  21/22                  21/22                  168/176

Has anyone else seen this issue, or any idea what might be causing it?

southquist commented 2 years ago

I've done some further testing with a second machine that also have 8 GPUs but they are a different model, and I can use all 8 cards there without any issue.

These are the cards on the machine that works:

$ lspci | grep NVIDIA
2d:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
32:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
5b:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
5f:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
b5:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
be:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
df:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)
e7:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 32GB] (rev a1)

And these are the cards on the machine were I can only use the first 4 GPUs.

$ lspci | grep NVIDIA
1d:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
23:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
43:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
49:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
b4:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
ba:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
e0:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)
e6:00.0 3D controller: NVIDIA Corporation TU102GL (rev a1)