the parameters resourceCores/resourceMem/resourceName cannot work

jxfruit commented 2 weeks ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

as the title described, all self-defined values cannot work

2. Steps to reproduce the issue

deployed with cmd: helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system yaml:

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          #nvidia.com/gpu: 2
          #nvidia.com/gpumem: 3000
          #nvidia.com/gpucores: 33
          #xx/vcuda-memory: 1
          xx/vcuda-core: 1
          #nvidia.com/gpumem: 3000

above the task got error: Error: endpoint not found in cache for a registered resource: xx/vcuda-core Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

cm about hami-scheduler-newversion:

apiVersion: v1
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: xx/vcuda-memory
        ignoredByScheduler: true
      - name: xx/vcuda-core
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
      - name: huawei.com/Ascend910-memory
        ignoredByScheduler: true
      - name: huawei.com/Ascend910
        ignoredByScheduler: true
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-06-16T12:41:49Z"
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 0.0.2
    helm.sh/chart: hami-2.0.0
  name: hami-scheduler-newversion
  namespace: kube-system
  resourceVersion: "72005391"
  uid: 7c8697ce-114c-4e16-9732-d4ccf7290e6b

3. Information to attach (optional if deemed irrelevant)

hami image version: v2.3.12 and latest all failed Common error checking:

[ ] The output of nvidia-smi -a on your host
[ ] Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
[ ] The vgpu-device-plugin container logs
[ ] The vgpu-scheduler container logs
[ ] The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

[ ] Docker version from docker version
[ ] Docker command, image and tag used
[ ] Kernel version from uname -a
[ ] Any relevant kernel output lines from dmesg

archlitchi commented 2 weeks ago

thanks ,we will fix that soon

lengrongfu commented 2 weeks ago

@archlitchi are you in the fix this issue?

lengrongfu commented 2 weeks ago

@jxfruit can you try again using below yaml content to test.

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          nvidia.com/gpu: 1
          xx/vcuda-core: 1

jxfruit commented 2 weeks ago

@lengrongfu still failed when create pod, got error: Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0

BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed.

So is there something I missed?

lengrongfu commented 2 weeks ago

Hope to provides below info:

[ ] provide node yaml content, node having registry nvidia.com/gpu=10 to allocatable and capacity?
[ ] provide you used hami version info?
[ ] Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0 This error is what components appear?

jxfruit commented 2 weeks ago

[ ] apiVersion: v1 kind: Pod metadata: name: xlabfe73ef20d3cc329522779f35dd1ebaa4 spec: restartPolicy: OnFailure containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4 image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0" command: ["bash", "-c", "sleep 100000"] resources: limits: nvidia.com/gpu: 10
[ ] hami image version: v2.3.12 and latest all failed
[ ] I am confused. I install hami with the command: helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system And I create pod with yaml: apiVersion: v1 kind: Pod metadata: name: xlabfe73ef20d3cc329522779f35dd1ebaa4 spec: restartPolicy: OnFailure containers:
- name: xlabfe73ef20d3cc329522779f35dd1ebaa4 image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0" command: ["bash", "-c", "sleep 100000"] resources: limits: nvidia.com/gpu: 1 xx/vcuda-core: 1

the pod describe info get the error info with command: kubectl describe pod/xlabfe73ef20d3cc329522779f35dd1ebaa4 Name: xlabfe73ef20d3cc329522779f35dd1ebaa4 Namespace: default Priority: 0 Node: inter-app2/ Start Time: Tue, 18 Jun 2024 15:36:08 +0800 Labels: Annotations: hami.io/bind-phase: allocating hami.io/bind-time: 1718696168 hami.io/vgpu-devices-allocated: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:; hami.io/vgpu-devices-to-allocate: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:; hami.io/vgpu-node: inter-app2 hami.io/vgpu-time: 1718696168 Status: Failed Reason: UnexpectedAdmissionError Message: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

the message changed.... After I tried a few times, I got: Warning FailedScheduling 25s hami-scheduler binding rejected: node inter-app2 has been locked within 5 minutes

lengrongfu commented 2 weeks ago

nvidia.com/gpu: 10 this field in the limits config value is what？ because i look have one pod value is 10， another pod value is 1.

this field in the limits value cannot more then node native device number.

jxfruit commented 2 weeks ago

@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

lengrongfu commented 1 week ago

I test this case, don't recurrence?

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

jxfruit commented 1 week ago

I test this case, don't recurrence?

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

can u share config.yaml or install command ? I try it again

lengrongfu commented 1 week ago

helm -n nvidia-vgpu get values nvidia-vgpu; you can refer to below value yaml, i am not found helm install command.

hami:
ascendResourceMem: huawei.com/Ascend910-memory
ascendResourceName: huawei.com/Ascend910
dcuResourceCores: hygon.com/dcucores
dcuResourceMem: hygon.com/dcumem
dcuResourceName: hygon.com/dcunum
devicePlugin:
deviceCoreScaling: 1
deviceMemoryScaling: 1
deviceSplitCount: 10
disablecorelimit: "false"
extraArgs:
- -v=false
hygonImageRepository: 4pdosc/vdcu-device-plugin
hygonImageTag: v1.0
hygondriver: /root/dcu-driver/dtk-22.10.1-vdcu
hygonimage: 4pdosc/vdcu-device-plugin:v1.0
hygonnodeSelector:
  dcu: "on"
imagePullPolicy: IfNotPresent
libPath: /usr/local/vgpu
migStrategy: none
mlunodeSelector:
  mlu: "on"
monitorctrPath: /usr/local/vgpu/containers
monitorimage: projecthami/hami
nvidianodeSelector:
  nvidia.com/gpu.deploy.container-toolkit: "true"
  nvidia.com/vgpu.deploy.device-plugin: "true"
pluginPath: /var/lib/kubelet/device-plugins
podAnnotations: {}
registry: docker.m.daocloud.io
repository: projecthami/hami
runtimeClassName: ""
service:
  httpPort: 31992
tolerations: []
fullnameOverride: ""
global:
annotations: {}
gpuHookPath: /usr/local
labels: {}
iluvatarResourceCore: iluvatar.ai/vcuda-core
iluvatarResourceMem: iluvatar.ai/vcuda-memory
iluvatarResourceName: iluvatar.ai/vgpu
imagePullSecrets: []
mluResourceCores: cambricon.com/mlu.smlu.vcore
mluResourceMem: cambricon.com/mlu.smlu.vmemory
mluResourceName: cambricon.com/vmlu
nameOverride: ""
podSecurityPolicy:
enabled: false
resourceCores: test/vcuda-core
resourceMem: test/vcuda-memory
resourceMemPercentage: nvidia.com/gpumem-percentage
resourceName: nvidia.com/vgpu
resourcePriority: nvidia.com/priority
resources:
limits:
  cpu: 500m
  memory: 720Mi
requests:
  cpu: 100m
  memory: 128Mi
scheduler:
customWebhook:
  enabled: false
  host: 127.0.0.1
  path: /webhook
  port: 31998
defaultCores: 0
defaultGPUNum: 1
defaultMem: 0
defaultSchedulerPolicy:
  gpuSchedulerPolicy: spread
  nodeSchedulerPolicy: binpack
extender:
  extraArgs:
  - --debug
  - -v=4
  imagePullPolicy: IfNotPresent
  registry: docker.m.daocloud.io
  repository: projecthami/hami
kubeScheduler:
  enabled: true
  extraArgs:
  - --policy-config-file=/config/config.json
  - -v=4
  extraNewArgs:
  - --config=/config/config.yaml
  - -v=4
  imagePullPolicy: IfNotPresent
  imageTag: v1.24.0
  registry: k8s-gcr.m.daocloud.io
  repository: kubernetes/kube-scheduler
leaderElect: true
metricsBindAddress: :9395
mutatingWebhookConfiguration:
  failurePolicy: Ignore
nodeName: ""
nodeSelector:
  nvidia.com/gpu.deploy.container-toolkit: "true"
patch:
  imagePullPolicy: IfNotPresent
  newRepository: liangjw/kube-webhook-certgen
  newTag: v1.1.1
  nodeSelector: {}
  podAnnotations: {}
  priorityClassName: ""
  registry: docker.io
  repository: jettech/kube-webhook-certgen
  runAsUser: 2000
  tag: v1.5.2
  tolerations: []
podAnnotations: {}
service:
  annotations: {}
  httpPort: 443
  labels: {}
  monitorPort: 31993
  schedulerPort: 31998
serviceMonitor:
  enable: false
tolerations: []
schedulerName: hami-scheduler
version: v2.3.11

jxfruit commented 1 week ago

Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts?

BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node.

Project-HAMi / HAMi