Open jxfruit opened 2 weeks ago
thanks ,we will fix that soon
@archlitchi are you in the fix this issue?
@jxfruit can you try again using below yaml content to test.
apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4
image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
command: ["bash", "-c", "sleep 100000"]
resources:
limits:
nvidia.com/gpu: 1
xx/vcuda-core: 1
@lengrongfu still failed when create pod, got error: Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0
BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed.
So is there something I missed?
Hope to provides below info:
nvidia.com/gpu=10
to allocatable and capacity?Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0
This error is what components appear?[ ]
apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
[ ] hami image version: v2.3.12 and latest all failed
[ ] I am confused. I install hami with the command:
helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system
And I create pod with yaml:
apiVersion: v1
kind: Pod
metadata:
name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
restartPolicy: OnFailure
containers:
the pod describe info get the error info with command: kubectl describe pod/xlabfe73ef20d3cc329522779f35dd1ebaa4
Name: xlabfe73ef20d3cc329522779f35dd1ebaa4
Namespace: default
Priority: 0
Node: inter-app2/
Start Time: Tue, 18 Jun 2024 15:36:08 +0800
Labels:
the message changed.... After I tried a few times, I got: Warning FailedScheduling 25s hami-scheduler binding rejected: node inter-app2 has been locked within 5 minutes
nvidia.com/gpu: 10 this field in the limits config value is what? because i look have one pod value is 10, another pod value is 1.
this field in the limits value cannot more then node native device number.
@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected
I test this case, don't recurrence?
kind: Deployment
apiVersion: apps/v1
metadata:
name: vgpu-deployment
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: vgpu-test
image: chrstnhntschl/gpu_burn
args:
- '6000'
resources:
limits:
nvidia.com/vgpu: '1'
test/vcuda-core: '50'
test/vcuda-memory: '4680'
I test this case, don't recurrence?
kind: Deployment apiVersion: apps/v1 metadata: name: vgpu-deployment namespace: default spec: replicas: 1 selector: matchLabels: app: my-app template: metadata: labels: app: my-app spec: containers: - name: vgpu-test image: chrstnhntschl/gpu_burn args: - '6000' resources: limits: nvidia.com/vgpu: '1' test/vcuda-core: '50' test/vcuda-memory: '4680'
can u share config.yaml or install command ? I try it again
hami:
ascendResourceMem: huawei.com/Ascend910-memory
ascendResourceName: huawei.com/Ascend910
dcuResourceCores: hygon.com/dcucores
dcuResourceMem: hygon.com/dcumem
dcuResourceName: hygon.com/dcunum
devicePlugin:
deviceCoreScaling: 1
deviceMemoryScaling: 1
deviceSplitCount: 10
disablecorelimit: "false"
extraArgs:
- -v=false
hygonImageRepository: 4pdosc/vdcu-device-plugin
hygonImageTag: v1.0
hygondriver: /root/dcu-driver/dtk-22.10.1-vdcu
hygonimage: 4pdosc/vdcu-device-plugin:v1.0
hygonnodeSelector:
dcu: "on"
imagePullPolicy: IfNotPresent
libPath: /usr/local/vgpu
migStrategy: none
mlunodeSelector:
mlu: "on"
monitorctrPath: /usr/local/vgpu/containers
monitorimage: projecthami/hami
nvidianodeSelector:
nvidia.com/gpu.deploy.container-toolkit: "true"
nvidia.com/vgpu.deploy.device-plugin: "true"
pluginPath: /var/lib/kubelet/device-plugins
podAnnotations: {}
registry: docker.m.daocloud.io
repository: projecthami/hami
runtimeClassName: ""
service:
httpPort: 31992
tolerations: []
fullnameOverride: ""
global:
annotations: {}
gpuHookPath: /usr/local
labels: {}
iluvatarResourceCore: iluvatar.ai/vcuda-core
iluvatarResourceMem: iluvatar.ai/vcuda-memory
iluvatarResourceName: iluvatar.ai/vgpu
imagePullSecrets: []
mluResourceCores: cambricon.com/mlu.smlu.vcore
mluResourceMem: cambricon.com/mlu.smlu.vmemory
mluResourceName: cambricon.com/vmlu
nameOverride: ""
podSecurityPolicy:
enabled: false
resourceCores: test/vcuda-core
resourceMem: test/vcuda-memory
resourceMemPercentage: nvidia.com/gpumem-percentage
resourceName: nvidia.com/vgpu
resourcePriority: nvidia.com/priority
resources:
limits:
cpu: 500m
memory: 720Mi
requests:
cpu: 100m
memory: 128Mi
scheduler:
customWebhook:
enabled: false
host: 127.0.0.1
path: /webhook
port: 31998
defaultCores: 0
defaultGPUNum: 1
defaultMem: 0
defaultSchedulerPolicy:
gpuSchedulerPolicy: spread
nodeSchedulerPolicy: binpack
extender:
extraArgs:
- --debug
- -v=4
imagePullPolicy: IfNotPresent
registry: docker.m.daocloud.io
repository: projecthami/hami
kubeScheduler:
enabled: true
extraArgs:
- --policy-config-file=/config/config.json
- -v=4
extraNewArgs:
- --config=/config/config.yaml
- -v=4
imagePullPolicy: IfNotPresent
imageTag: v1.24.0
registry: k8s-gcr.m.daocloud.io
repository: kubernetes/kube-scheduler
leaderElect: true
metricsBindAddress: :9395
mutatingWebhookConfiguration:
failurePolicy: Ignore
nodeName: ""
nodeSelector:
nvidia.com/gpu.deploy.container-toolkit: "true"
patch:
imagePullPolicy: IfNotPresent
newRepository: liangjw/kube-webhook-certgen
newTag: v1.1.1
nodeSelector: {}
podAnnotations: {}
priorityClassName: ""
registry: docker.io
repository: jettech/kube-webhook-certgen
runAsUser: 2000
tag: v1.5.2
tolerations: []
podAnnotations: {}
service:
annotations: {}
httpPort: 443
labels: {}
monitorPort: 31993
schedulerPort: 31998
serviceMonitor:
enable: false
tolerations: []
schedulerName: hami-scheduler
version: v2.3.11
Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts?
BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node.
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Issue or feature description
as the title described, all self-defined values cannot work
2. Steps to reproduce the issue
deployed with cmd: helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system yaml:
above the task got error: Error: endpoint not found in cache for a registered resource: xx/vcuda-core Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected
cm about hami-scheduler-newversion:
3. Information to attach (optional if deemed irrelevant)
hami image version: v2.3.12 and latest all failed Common error checking:
nvidia-smi -a
on your host/etc/docker/daemon.json
)sudo journalctl -r -u kubelet
)Additional information that might help better understand your environment and reproduce the bug:
docker version
uname -a
dmesg