Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
485 stars 118 forks source link

the parameters resourceCores/resourceMem/resourceName cannot work #356

Open jxfruit opened 2 weeks ago

jxfruit commented 2 weeks ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

as the title described, all self-defined values cannot work

2. Steps to reproduce the issue

deployed with cmd: helm install hami hami-charts/hami --set resourceCores=xx/vcuda-core --set resourceMem=xx/vcuda-memory --set scheduler.kubeScheduler.imageTag=v1.22.8 -n kube-system yaml:

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          #nvidia.com/gpu: 2
          #nvidia.com/gpumem: 3000
          #nvidia.com/gpucores: 33
          #xx/vcuda-memory: 1
          xx/vcuda-core: 1
          #nvidia.com/gpumem: 3000

above the task got error: Error: endpoint not found in cache for a registered resource: xx/vcuda-core Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

cm about hami-scheduler-newversion:

apiVersion: v1
data:
  config.yaml: |
    apiVersion: kubescheduler.config.k8s.io/v1beta2
    kind: KubeSchedulerConfiguration
    leaderElection:
      leaderElect: false
    profiles:
    - schedulerName: hami-scheduler
    extenders:
    - urlPrefix: "https://127.0.0.1:443"
      filterVerb: filter
      bindVerb: bind
      nodeCacheCapable: true
      weight: 1
      httpTimeout: 30s
      enableHTTPS: true
      tlsConfig:
        insecure: true
      managedResources:
      - name: nvidia.com/gpu
        ignoredByScheduler: true
      - name: xx/vcuda-memory
        ignoredByScheduler: true
      - name: xx/vcuda-core
        ignoredByScheduler: true
      - name: nvidia.com/gpumem-percentage
        ignoredByScheduler: true
      - name: nvidia.com/priority
        ignoredByScheduler: true
      - name: cambricon.com/vmlu
        ignoredByScheduler: true
      - name: hygon.com/dcunum
        ignoredByScheduler: true
      - name: hygon.com/dcumem
        ignoredByScheduler: true
      - name: hygon.com/dcucores
        ignoredByScheduler: true
      - name: iluvatar.ai/vgpu
        ignoredByScheduler: true
      - name: huawei.com/Ascend910-memory
        ignoredByScheduler: true
      - name: huawei.com/Ascend910
        ignoredByScheduler: true
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: hami
    meta.helm.sh/release-namespace: kube-system
  creationTimestamp: "2024-06-16T12:41:49Z"
  labels:
    app.kubernetes.io/component: hami-scheduler
    app.kubernetes.io/instance: hami
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: hami
    app.kubernetes.io/version: 0.0.2
    helm.sh/chart: hami-2.0.0
  name: hami-scheduler-newversion
  namespace: kube-system
  resourceVersion: "72005391"
  uid: 7c8697ce-114c-4e16-9732-d4ccf7290e6b

3. Information to attach (optional if deemed irrelevant)

hami image version: v2.3.12 and latest all failed Common error checking:

Additional information that might help better understand your environment and reproduce the bug:

archlitchi commented 2 weeks ago

thanks ,we will fix that soon

lengrongfu commented 2 weeks ago

@archlitchi are you in the fix this issue?

lengrongfu commented 2 weeks ago

@jxfruit can you try again using below yaml content to test.

apiVersion: v1
kind: Pod
metadata:
  name: xlabfe73ef20d3cc329522779f35dd1ebaa4
spec:
  restartPolicy: OnFailure
  containers:
    - name: xlabfe73ef20d3cc329522779f35dd1ebaa4
      image: "swr.cn-east-3.myhuaweicloud.com/senseyu/exam-gpu:dev_1.0"
      command: ["bash", "-c", "sleep 100000"]
      resources:
        limits:
          nvidia.com/gpu: 1
          xx/vcuda-core: 1
jxfruit commented 2 weeks ago

@lengrongfu still failed when create pod, got error: Node didn't have enough resource: xx/vcuda-core, requested: 1, used: 0, capacity: 0

BTW, the doc(https://github.com/Project-HAMi/HAMi/blob/master/docs/config.md) said that the install parameter 'devicePlugin.deviceSplitCount' will generate N vgpu, but after setting, I create pod use the vgpu, which will failed.

So is there something I missed?

lengrongfu commented 2 weeks ago

Hope to provides below info:

jxfruit commented 2 weeks ago

the pod describe info get the error info with command: kubectl describe pod/xlabfe73ef20d3cc329522779f35dd1ebaa4 Name: xlabfe73ef20d3cc329522779f35dd1ebaa4 Namespace: default Priority: 0 Node: inter-app2/ Start Time: Tue, 18 Jun 2024 15:36:08 +0800 Labels: Annotations: hami.io/bind-phase: allocating hami.io/bind-time: 1718696168 hami.io/vgpu-devices-allocated: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:; hami.io/vgpu-devices-to-allocate: GPU-5a6f5eeb-ad9e-7ac9-a40f-30ba7bbc5713,NVIDIA,15109,1:; hami.io/vgpu-node: inter-app2 hami.io/vgpu-time: 1718696168 Status: Failed Reason: UnexpectedAdmissionError Message: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

the message changed.... After I tried a few times, I got: Warning FailedScheduling 25s hami-scheduler binding rejected: node inter-app2 has been locked within 5 minutes

lengrongfu commented 2 weeks ago

nvidia.com/gpu: 10 this field in the limits config value is what? because i look have one pod value is 10, another pod value is 1.

this field in the limits value cannot more then node native device number.

jxfruit commented 2 weeks ago

@lengrongfu the two pods are different situations. I just make a response for your questions. When only use ‘nvidia.com/gpu: 10’, the pod will be pending for the sake of no enough resource can be allocated. When add 'xx/vcuda-core' , the pod will throw the error: Pod Allocate failed due to can't allocate unregistered device xx/vcuda-core, which is unexpected

lengrongfu commented 1 week ago

I test this case, don't recurrence?

image

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'
jxfruit commented 1 week ago

I test this case, don't recurrence?

image

kind: Deployment
apiVersion: apps/v1
metadata:
  name: vgpu-deployment
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-app
  template:
    metadata:
      labels:
        app: my-app
    spec:
      containers:
        - name: vgpu-test
          image: chrstnhntschl/gpu_burn
          args:
            - '6000'
          resources:
            limits:
              nvidia.com/vgpu: '1'
              test/vcuda-core: '50'
              test/vcuda-memory: '4680'

can u share config.yaml or install command ? I try it again

lengrongfu commented 1 week ago
jxfruit commented 1 week ago

Sorry, failed to use your values.yaml to deploy. The chart is incomplete. Can u share the total installing helm charts?

BTW, in my testing, field resourceName works when custom value. However, it cannot allocate the vgpu which is greater than the current node.