NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Apache License 2.0
287 stars 47 forks source link

Mixed MIG strategy reported wrong labels prefix (MIG_TYPE mig-3g.21gb) #7

Closed andiariffin closed 3 years ago

andiariffin commented 3 years ago

Hi, I was deploying few clusters with A100 GPUs using DeepOps 20.12. Mixed MIG strategy is used as it is defined in DeepOps configuration. However, I noticed that the gfd reported some wrong labeling, i.e.:

$ kubectl describe nodes a100-node6 | grep mig
                    nvidia.com/mig-3g.21gb.count=16
                    nvidia.com/mig-3g.21gb.engines.copy=3
                    nvidia.com/mig-3g.21gb.engines.decoder=2
                    nvidia.com/mig-3g.21gb.engines.encoder=0
                    nvidia.com/mig-3g.21gb.engines.jpeg=0
                    nvidia.com/mig-3g.21gb.engines.ofa=0
                    nvidia.com/mig-3g.21gb.memory=20096
                    nvidia.com/mig-3g.21gb.multiprocessors=42
                    nvidia.com/mig-3g.21gb.slices.ci=3
                    nvidia.com/mig-3g.21gb.slices.gi=3
                    nvidia.com/mig.strategy=mixed
  nvidia.com/mig-3g.20gb:  16
  nvidia.com/mig-3g.20gb:  16
  nvidia.com/mig-3g.20gb  0               0

The last three lines were under Capacity, Allocatable and Allocated resources respectively which were already correct. However, the Labels was incorrectly defined (e.g. mig-3g.21gb -> should be mig-3g.20gb).

Although this issue seems to be not breaking any K8s functionality in terms of deploying pods with MIG, it would be nice to have the GPU nodes having properly labeled.

klueska commented 3 years ago

Hi @andiariffin. Thanks for the report.

A bug was fixed in the plugin some time ago to divide by 1024 instead of 1000 here: https://github.com/NVIDIA/k8s-device-plugin/blob/master/mig-strategy.go#L208

It looks like that change didn't make it into gpu-feature-discovery though: https://github.com/NVIDIA/gpu-feature-discovery/blob/master/mig-strategy.go#L276

Ideally there would be one library that both of these pulled from for this, but unfortunately that is not the state of things yet. In any case, we will push a fix out for this soon. Thanks again for reporting.

klueska commented 3 years ago

This has now been fixed in https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/merge_requests/61 and will be part of the next GFD release. Thanks for reporting.