NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

Removal of dependencies on cuda v10 #161

Closed mamccorm closed 3 weeks ago

mamccorm commented 5 months ago

Hi there,

cuda v10 has been EOL for some time, however it appears there are still several references to it in various places in the project.

Are there any plans to remove the cuda v10 dependencies in the pipeline?

bmarchant commented 5 months ago

@mamccorm yes there is. The next release of DCGM will remove all references to CUDA v10. Thanks!

mamccorm commented 5 months ago

Appreciate the reply @bmarchant, that's great to hear. Any ballpark timelines for when the next release may be due to land?

bmarchant commented 5 months ago

@mamccorm We are trying to get it released in the next week or two.

nikkon-dev commented 5 months ago

@mamccorm

I need to correct @bmarchant. The next major DCGM release will remove Cuda10, which is planned for later this year. However, the upcoming release in the DCGM 3.x branch will still provide Cuda10 plugins.

The DCGM policy is to support three major Cuda versions in each release.

Could you provide more details on the issues that you are facing with Cuda10?

mamccorm commented 5 months ago

Thanks @nikkon-dev for the follow-up, and much appreciated.

We're keen to leverage DCGM without a dependency on EOL software. Looks like CUDA v10 was added to the end-of-life section of GitLab ~9 months ago, and the binaries are also not redistributed anymore here.

Also cross-referencing some info from this doc:

nikkon-dev commented 5 months ago

Hi @mamccorm,

I just wanted to clarify that the DCGM package doesn't rely on any Cuda packages. All the necessary components are linked or provided by the DCGM package itself and loaded at runtime based on the detected driver. It's important to note that the DCGM 3.x branch has been supporting drivers since R418, and even if Cuda10 is EOL, we cannot remove it from the package.

mamccorm commented 5 months ago

Thanks @nikkon-dev. So no cuda10 runtime dep, but the build process requires cuda10/11/12 to enable building of each plugin from source for their respective cuda versions?

The redist packages for cuda10 are no longer published, and outside of debian/ubuntu, there may be no existing pre-build cuda10 package to pull (which is issue we're facing when theres a buildtime dep in DCGM).

In any event this has been helpful and look forward to future releases

nikkon-dev commented 3 weeks ago

Cuda10 was removed from the OSS builds.