NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
358 stars 48 forks source link

how to run "memory bandwidth" test using nvvs/dcgmi, which is based on DCGM source code #67

Open ligeweiwu opened 1 year ago

ligeweiwu commented 1 year ago

Hi I am building the DCGM source code and using nvvs/dcgmi to perform the diagnostic test. I see all plugintest and they are all in the format of .so. .But when I want to perform the "memory bandwidth" diagnostic, they give me an error:

./dcgmi diag -r "memory bandwidth" -g 2 Error: requested test "memory bandwidth" was not found among possible test choices.

In my case, all plugin.so are in the location of /username/DCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/plugins/cuda11, and there is no name of "memory bandwidth". And I also see the source code, actually i think it doesn't have the option name "memory bandwidth". It only has "memtest".

So please tell me how can I run "memory bandwith" using DCGM source code?

By the way, the memtest is OK ("./dcgmi diag -r memtest -g 2" works fine, and I also see the corresponding libMemtest.so in plugins/cuda11, and the source code has the option "memtest").

Thanks.

dbeer commented 1 year ago

Hi Ligeweiwu - the memory bandwidth test is unfortunately not yet releasable as open source. To run the test with open source, you can download a released version of DCGM that matches the open source you're building and copy the plugin libraries to your locally built plugins dir.

ligeweiwu commented 1 year ago

@dbeer Thanks for your reply. I have another concept want to confirm. In plugin_src, memory. <-> -r3 test : GPU Memory memtest. <-> -r4 test: Memory Stress memory bandwidth <-> no source code, can only use the released version package Is that right?

Thanks

dbeer commented 1 year ago

That's correct, although the memory test should run with -r 2 and higher.

ligeweiwu commented 1 year ago

@dbeer Hi dbeer Thanks for you reply. I am building DCGM source code based on the version 3.0.4 (commit version: f6fe5654b780873da528b84cb3d7de10d7abe0d1). But I can not find the corresponding download linking for this version. Could you tell me that where can I download the corresponding released package for this version ? Thanks.

nikkon-dev commented 1 year ago

@ligeweiwu,

All package versions are available in the public Cuda repositories: Deb Rpm