gpu.memory label for non mig describes size in bytes

NVIDIA / gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes

Apache License 2.0

292 stars 47 forks source link

gpu.memory label for non mig describes size in bytes #26

Closed davidLif closed 1 year ago

davidLif commented 2 years ago

Hello,

While testing a GKE node with a 40 GB, I noticed that the nvidia.com/gpu.memory label on the node had a value of "42505273344". Accordingn to the README, which states that the label should contain "Memory of the GPU in Mb".

After looking at the code, I see that happens because for MIG strategy "None", the value is extracted using nvmlDeviceGetMemoryInfo. According to the docs ( https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g2dfeb1db82aa1de91aa6edf941c85ca8 ) the function "retrieves the amount of used, free, reserved and total memory available on the device, in bytes".

elezar commented 2 years ago

@davidLif thanks for reporting this. Could you confirm the version of GFD that you are using?

davidLif commented 2 years ago

I am using GFD v0.6.1-ubi8. This version came by default from installing gpu-operator v1.11.1

elezar commented 2 years ago

Thanks for the confirmation. I have created https://gitlab.com/nvidia/kubernetes/gpu-feature-discovery/-/merge_requests/127 to address this issue.

Is this a critical issue from your perspective, or could this wait for the next release which should be out by the end of September?

davidLif commented 2 years ago

I it can wait for the next release.

Does GFD writes a label with it's version on the node? I am trying to think about an easy way of handeling this case for k8s operators using the nvidia.com/gpu.memory label.

elezar commented 2 years ago

GFD does not generate a label with it's version as far as I am aware.

elezar commented 1 year ago

@davidLif the fix for the issue you are seeing was already released in v0.6.2. Could you confirm that this is the case?

davidLif commented 1 year ago

@elezar The fix is working. Thanks!