NTHU-LSALAB / KubeShare

Share GPU between Pods in Kubernetes
Apache License 2.0
193 stars 42 forks source link

How to configure kubeshare-config.yaml? #22

Closed jungyh0218 closed 1 year ago

jungyh0218 commented 1 year ago

Hello. I recently converted from KubeShare 1.0 ver to KubeShare2.0 and it seems like many major factors are changed in the 2.0 version. I try to use 2.0 but the pod always fails to be scheduled. I guess the cause of the issue is misconfiguration of kubeshare-config.yaml file. When I command 'kubectl describe pod ', I get the error message like this:

Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.
Warning  FailedScheduling  9s    kubeshare-scheduler  0/3 nodes are available: 1 [Filter] Node gpu01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node gpu02 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0, 1 [Filter] Node mgmt01 doesn't meet the gpu request of pod default/pod1(2c2d8706-4b74-4063-a3b4-4542ee8e7089) in Filter, totalCPUs: 0, totalMemory: 0, totalFits: 0.

I have a GPU cluster and here is the physical structure of the cluster. gpu

And here is the config file I wrote. What is wrong with it?

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02
justin0u0 commented 1 year ago

Hi @jungyh0218, I think that the config file should be

cellTypes:
  T4-NODE:
    childCellType: "Tesla-T4"
    childCellNumber: 2
    isNodeLevel: true

cells:
- cellType: T4-NODE
  cellChildren:
  - cellId: gpu01
  - cellId: gpu02

And make sure that the childCellType's name is equal to what showing in the nvidia-smi -L command and convert spaces to -.

Ref: https://github.com/NTHU-LSALAB/KubeShare/blob/master/doc/deploy.md