NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

DCGM diagnostics in the container with less than 8 GPUs the test fails #90

Open sanghvimanan opened 1 year ago

sanghvimanan commented 1 year ago

Setup - DCGM checks for a Google Kubernetes pod requiring GPUs(<8) fails with the test being executed in the init container.

As part of the preflight health checks if we run DCGM diagnostics in the container with less than 8 GPUs the test fails with following error:

| Permissions and OS Blocks | Fail |

| Error | File /dev/nvidia7 could not be accessed direc |

| | tly: Operation not permitted Check relevant p |

| | ermissions, access, and existence of the file |

| | ., File /dev/nvidia2 could not be accessed di |

| | rectly: Operation not permitted Check relevan |

| | t permissions, access, and existence of the f |

| | ile., File /dev/nvidia1 could not be accessed |

| | directly: Operation not permitted Check rele |

| | vant permissions, access, and existence of th |

| | e file., File /dev/nvidia0 could not be acces |

| | sed directly: Operation not permitted Check r |

| | elevant permissions, access, and existence of |

| | the file., The number of devices NVML return |

| | s is different than the number of devices in |

| | /dev. Check for the presence of cgroups, oper |

| | ating system blocks, and or unsupported / old |

| | er cards

It seems that the user container have all the /dev/nvidia0, /dev/nvidia1, /dev/nvidia2, /dev/nvidia3 ... untill /dev/nvidia7 mounted while NML only sees 4 GPU devices. This discrepancy between the number of devices in /dev and the number of devices seen by NVML results in failure of the DCGM diagnostics.

dbeer commented 1 year ago

@sanghvimanan I'm working on a fix here. Can you share details on how you created this container?

dbeer commented 1 year ago

This issue has now been fixed and will released with DCGM 3.2.6.

sanghvimanan commented 1 year ago

Awesome! Thanks, David.