Closed PhilixHe closed 2 weeks ago
describe node 看一下 nvidia.com/gpu 设置为多少。
describe node 看一下 nvidia.com/gpu 设置为多少。
Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1
Allocated resources:
Resource Requests Limits
nvidia.com/gpu 1 1
@PhilixHe
@Nimbus318 GPU 节点的 Annotation:
Annotations: hami.io/node-handshake: Requesting_2024.11.04 03:46:54
hami.io/node-handshake-dcu: Deleted_2024.10.29 07:16:59
hami.io/node-nvidia-register: GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806,1,24576,100,NVIDIA-NVIDIA GeForce RTX 3090,0,true:
device-plugin
Name: hami-device-plugin-zjpbr
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: hami-device-plugin
Node: 10.35.211.17/10.35.211.17
Start Time: Thu, 31 Oct 2024 20:49:02 +0800
Labels: app.kubernetes.io/component=hami-device-plugin
app.kubernetes.io/instance=hami
app.kubernetes.io/name=hami
controller-revision-hash=6896d767b5
hami.io/webhook=ignore
pod-template-generation=2
Annotations: kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu, memory request for container device-plugin; cpu, memory limit for container device-plugin; cpu, memory reques...
Status: Running
IP: 10.35.211.17
IPs:
IP: 10.35.211.17
Controlled By: DaemonSet/hami-device-plugin
Containers:
device-plugin:
Container ID: containerd://660f23ad5d2e9f74c43a7cef33faa700c3649e40f787931453c437b48ade4fdd
Image: projecthami/hami:v2.4.0
Image ID: projecthami/hami@sha256:b0c27ece46b20d307f858c6dd385276fb094d058b151df9533143a0db71bd574
Port: <none>
Host Port: <none>
Command:
nvidia-device-plugin
--resource-name=nvidia.com/gpu
--mig-strategy=none
--device-memory-scaling=1
--device-cores-scaling=1
--device-split-count=1
--disable-core-limit=false
-v=false
State: Running
Started: Thu, 31 Oct 2024 20:49:08 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 500m
memory: 1Gi
Environment:
NODE_NAME: (v1:spec.nodeName)
NVIDIA_MIG_MONITOR_DEVICES: all
HOOK_PATH: /usr/local
Mounts:
/config from deviceconfig (rw)
/tmp from hosttmp (rw)
/usr/local/vgpu from lib (rw)
/usrbin from usrbin (rw)
/var/lib/kubelet/device-plugins from device-plugin (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z82bt (ro)
vgpu-monitor:
Container ID: containerd://9e9fc50a8f08051741e1139c19c85cb0c186ee1f6cc88be09106e0b67f161544
Image: projecthami/hami:v2.4.0
Image ID: projecthami/hami@sha256:b0c27ece46b20d307f858c6dd385276fb094d058b151df9533143a0db71bd574
Port: <none>
Host Port: <none>
Command:
vGPUmonitor
State: Running
Started: Thu, 31 Oct 2024 20:49:13 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 4Gi
Requests:
cpu: 500m
memory: 1Gi
Environment:
NVIDIA_VISIBLE_DEVICES: all
NVIDIA_MIG_MONITOR_DEVICES: all
HOOK_PATH: /usr/local/vgpu
Mounts:
/hostvar from hostvar (rw)
/run/containerd from containerds (rw)
/run/docker from dockers (rw)
/sysinfo from sysinfo (rw)
/usr/local/vgpu/containers from ctrs (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-z82bt (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
ctrs:
Type: HostPath (bare host directory volume)
Path: /usr/local/vgpu/containers
HostPathType:
hosttmp:
Type: HostPath (bare host directory volume)
Path: /tmp
HostPathType:
dockers:
Type: HostPath (bare host directory volume)
Path: /run/docker
HostPathType:
containerds:
Type: HostPath (bare host directory volume)
Path: /run/containerd
HostPathType:
device-plugin:
Type: HostPath (bare host directory volume)
Path: /var/lib/kubelet/device-plugins
HostPathType:
lib:
Type: HostPath (bare host directory volume)
Path: /usr/local/vgpu
HostPathType:
usrbin:
Type: HostPath (bare host directory volume)
Path: /usr/bin
HostPathType:
sysinfo:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType:
hostvar:
Type: HostPath (bare host directory volume)
Path: /var
HostPathType:
deviceconfig:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: hami-device-plugin
Optional: false
kube-api-access-z82bt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: gpu=on
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/network-unavailable:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists
node.kubernetes.io/pid-pressure:NoSchedule op=Exists
node.kubernetes.io/unreachable:NoExecute op=Exists
node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events: <none>
你部署的时候,应该是把 values 的这个地方改成了 1,这个地方的配置意味着,一张卡同时最多可以给多少个 container 共享,默认应该是 10,如果是当前配置成 1 了,就只能给一个 Pod 用了,第二个 Pod 就只能 Pending 了,所以你可以把这个调大,第二个 Pod 应该就可以正常调度了
@Nimbus318 明白了,我试一下。
Please provide an in-depth description of the question you have: 创建两个Pod,Yaml分别配置了 nvidia.com/gpu: 1,nvidia.com/gpumem: 10000; Pod-1 正常启动,容器内GPU资源符合预期,Pod-2 Describe 显示hami-scheduler调度失败"no available node, all node scores do not meet "
节点metrics信息: ![Uploading Pasted Graphic.png…]()
What do you think about this question?:
Environment: