node-gpu01 kubelet[69306]: I0804 17:53:28.207639 69306 setters.go:283] Update capacity for aliyun.com/gpu-mem to 31
docker nvidia-smi
node-gpu01:~# docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
Unable to find image 'nvidia/cuda:10.0-base' locally
10.0-base: Pulling from nvidia/cuda
7ddbc47eeb70: Pull complete
c1bbdc448b72: Pull complete
8c3b70e39044: Pull complete
45d437916d57: Pull complete
d8f1569ddae6: Pull complete
de5a2c57c41d: Pull complete
ea6f04a00543: Pull complete
Digest: sha256:e6e1001f286d084f8a3aea991afbcfe92cd389ad1f4883491d43631f152f175e
Status: Downloaded newer image for nvidia/cuda:10.0-base
Tue Aug 4 14:08:26 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... Off | 00000000:3B:00.0 Off | 0 |
| N/A 32C P0 25W / 250W | 12MiB / 32510MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
So, here is a Pod gpu-player with the exact same image from demo video which can't be scheduled due to Insufficient aliyun.com/gpu-mem resource
kubectl -n gpu-test describe pod gpu-player-f576f5dd4-njhrs
Name: gpu-player-f576f5dd4-njhrs
Namespace: gpu-test
Priority: 100
PriorityClassName: default-priority
Node:
Labels: app=gpu-player
pod-template-hash=f576f5dd4
Annotations:
Status: Pending
IP:
Controlled By: ReplicaSet/gpu-player-f576f5dd4
Containers:
gpu-player:
Image: cheyang/gpu-player
Port:
Host Port:
Limits:
aliyun.com/gpu-mem: 512
Requests:
aliyun.com/gpu-mem: 512
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mjdsm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-mjdsm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mjdsm
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
pool=automated-moderation:NoSchedule
Events:
Type Reason Age From Message
Warning FailedScheduling 3m15s (x895 over 17h) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.
I didn't find any errors in logs, but I'm ready to post any logs, versions, if necessary. What's wrong?
Hi! I've installed all the software from the docs https://github.com/AliyunContainerService/gpushare-scheduler-extender/blob/master/docs/install.md
I've configured all the docker/k8s components, but scheduler still can't assign pod to node with:
Everything seem to be running correctly on my nodes:
scheduler output:
my typical gpu-node outputs:
+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
kubectl -n gpu-test describe pod gpu-player-f576f5dd4-njhrs Name: gpu-player-f576f5dd4-njhrs Namespace: gpu-test Priority: 100 PriorityClassName: default-priority Node:
Labels: app=gpu-player
pod-template-hash=f576f5dd4
Annotations:
Status: Pending
IP:
Controlled By: ReplicaSet/gpu-player-f576f5dd4
Containers:
gpu-player:
Image: cheyang/gpu-player
Port:
Host Port:
Limits:
aliyun.com/gpu-mem: 512
Requests:
aliyun.com/gpu-mem: 512
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-mjdsm (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-token-mjdsm:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-mjdsm
Optional: false
QoS Class: BestEffort
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
pool=automated-moderation:NoSchedule
Events:
Type Reason Age From Message
Warning FailedScheduling 3m15s (x895 over 17h) default-scheduler 0/72 nodes are available: 72 Insufficient aliyun.com/gpu-mem.