Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
662 stars 153 forks source link

Display the remaining number of gpu in node resources #477

Open devenami opened 2 weeks ago

devenami commented 2 weeks ago

1. Issue or feature description

$ kubectl describe node gpu-000-001
Name:               gpu-000-001
Capacity:
  cpu:                128
  ephemeral-storage:  13119414984Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1055950504Ki
  nvidia.com/gpu:     80
  pods:               110
Allocatable:
  cpu:                124
  ephemeral-storage:  12907602632Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             1017153192Ki
  nvidia.com/gpu:     70
  pods:               110
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                4885m (3%)    75 (60%)
  memory             68663Mi (6%)  151Gi (15%)
  ephemeral-storage  0 (0%)        0 (0%)
  hugepages-1Gi      0 (0%)        0 (0%)
  hugepages-2Mi      0 (0%)        0 (0%)
  nvidia.com/gpu     5             5

In the above information, there are a total of 8 GPUs on the physical node. The gpu resources in Allocatable have been expanded by 10 times. I deployed 5 Pods on it, and each pod occupies a whole GPU

I want to see the remaining number of available GPUs and some other related information on node (such as the maximum available virtual existing per card, the remaining number of complete GPUs).

When the Pod is in Pending state, we can determine the reason for Pending by checking the Node information.

archlitchi commented 1 week ago

refer to this: https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#monitor