NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
373 stars 49 forks source link

Support on CUDA12.0 for DCGMPROFTESTER #102

Closed yasirjamal87 closed 1 year ago

yasirjamal87 commented 1 year ago

Is there support available on CUDA12.0 for DCGMPROFTESTER?

I am getting following error when using,

yasir@36gpu7:~$ dcgmproftester12 --no-dcgm-validation -t 1004 -d 120
CacheManager Init Failed. Error: -29

and when i use the following there is no outcome,

yasir@36gpu7:~$ !124
dcgmproftester11 --no-dcgm-validation -t 1004 -d 120
nikkon-dev commented 1 year ago

@yasirjamal87,

Cuda12 is fully supported, and seeing the -29 error is quite strange here, though. It means the root is required.

Could you share the dcgmproftester debug logs?

yasirjamal87 commented 1 year ago

looks like sudo was needed. Thanks

yasir@36gpu7:~$ sudo dcgmproftester12 --no-dcgm-validation -d 120
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Skipping CreateDcgmGroups() since DCGM validation is disabled
Worker 0:0[1001]: GrActivity: generated 0.000/0.000, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 1.000 seconds.
Worker 1:0[1001]: GrActivity: generated 0.000/0.000, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 1.000 seconds.
Worker 0:1[1001]: GrActivity: generated 0.000/0.000, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 1.000 seconds.
Worker 1:1[1001]: GrActivity: generated 0.000/0.000, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 1.000 seconds.
Worker 0:0[1001]: GrActivity: generated 0.000/0.008, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 2.000 seconds.
Worker 1:0[1001]: GrActivity: generated 0.000/0.008, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 2.000 seconds.
Worker 0:1[1001]: GrActivity: generated 0.000/0.008, dcgm M:{ GPU: 0.000, GI: 0.000, CI: 0.000 } at 2.000 seconds.