issues
search
NVIDIA
/
DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355
stars
49
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
#include "newrandom.h"
#129
jmikedupont2
opened
8 months ago
1
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 changes
#128
jmikedupont2
closed
8 months ago
5
Old data are copied into new data in dcgmGroupSamples.GetAllSinceLastCall
#127
optyang
opened
8 months ago
0
device memory ECC Errors can not take effect
#126
xiaohai1234
closed
8 months ago
2
H100 GPU docker container exit 137
#125
nusaputra137
opened
8 months ago
0
How do I inject errors into the GPU hardware?
#124
eafayao
closed
8 months ago
1
How to use profiling from python bindings?
#123
optyang
closed
8 months ago
2
[Question]: how to detect GPUs with low compute performance
#122
dmonakhov
closed
4 months ago
9
How do you install datacanter-gpu-manager-config package?
#121
nguoido
opened
8 months ago
1
Couldn't load a definition for ShutdownPlugin in plugin libSoftware.so
#120
nguoido
opened
8 months ago
19
dcgm nvlink metrics not available on dcgm 3.1.3
#119
luccabb
opened
8 months ago
4
Missing EUD Package Link on Guide
#118
canerozer
closed
2 days ago
5
datacenter-gpu-manager have an arm architecture installation package?
#117
xigang
closed
8 months ago
2
DCGM 3.2.6
#116
nikkon-dev
closed
9 months ago
0
makefile for test7 missing
#115
optyang
opened
9 months ago
2
NVSwitch power
#114
Mutinifni
opened
9 months ago
2
Support for Mariner (Azure Linux)
#113
LiquidPT
closed
8 months ago
2
how can I clear stale XID error
#112
zdyang
opened
9 months ago
0
Multi host fixes
#111
dmonakhov
closed
8 months ago
4
Added a reference implementation of DCGM + NCCL multi-node testing
#110
nikkon-dev
closed
9 months ago
0
question about : reference implementation of DCGM + NCCL multi-node testing.
#109
dmonakhov
closed
9 months ago
2
I stopped nvidia-dcgm.service and disabled auto-restart. I still see dcgmi CLI working. Will dcgmi CLI not use nvidia-dcgm.service?
#108
vinayburugu
closed
10 months ago
5
python dcgm_structs.DcgmJSONEncoder does not recursively follow arrays
#107
blackwer
opened
10 months ago
0
Can the dcgm exporter be run in two containers on a physical machine together with other programs that call the dcgm api?
#106
xcode03
closed
10 months ago
3
DCGM 3.2.5
#105
nikkon-dev
closed
10 months ago
0
Running a new dcgm wrapper by using the dcgmInit()
#104
vinayburugu
closed
10 months ago
6
How is NVLINK information obtained ?
#103
irvingans
opened
10 months ago
2
Support on CUDA12.0 for DCGMPROFTESTER
#102
yasirjamal87
closed
10 months ago
2
[feature request] add support of config file support which will be applied to all SKUs
#101
dmonakhov
opened
11 months ago
0
Regarding questions in dcgm official website, can I ask here?
#100
irvingans
opened
11 months ago
0
Why there is no corresponding .cpp file for dcgm_agent.h?
#99
irvingans
opened
11 months ago
4
sm_stress test is missing from dcgm-3.2.3
#98
bsteinb
opened
11 months ago
1
Support for A30/40 and L30 GPUs?
#97
SamKG
opened
11 months ago
1
What are the detailed meanings of some test items in DCGM Diag.cpp?
#96
irvingans
opened
11 months ago
4
H100 MIG Instances failed to destroy after usage
#95
Sipondo
closed
11 months ago
7
DCGM Diag Command on Mig Instance
#94
yasirjamal87
closed
11 months ago
2
Not able to get DCGM stats for MIG partitions
#93
berhane
closed
11 months ago
4
How is NVIDIA DCGM Documentation built from sources?
#92
hwhsu1231
closed
11 months ago
1
'dcgmi discovery -c' not returning MIG instances for H100
#91
Sipondo
closed
11 months ago
2
DCGM diagnostics in the container with less than 8 GPUs the test fails
#90
sanghvimanan
opened
12 months ago
3
Bundled CUDA libraries
#89
zzzoom
opened
1 year ago
0
PROF_PCIE_[T|R]X_BYTES is N/A in A100 mig
#88
luckqk
closed
1 year ago
2
How to monitor specific GPU
#87
LukeLIN-web
closed
1 year ago
2
Support for reporting FP8 and Transformer Engine usage on H100 GPU's
#86
hassanbabaie
opened
1 year ago
4
Docker image build fails with unprivileged user namespace isolation
#85
lahwaacz
opened
1 year ago
3
dcgm diag -r pulse_test, no parameter freq0
#84
glyangmail
opened
1 year ago
2
python binding process smutilization always return 2147483632
#83
ytaoeer
opened
1 year ago
6
Is there a way to disallow sharing of MIG devices?
#82
starry91
opened
1 year ago
1
DCGM 3.1.8
#81
nikkon-dev
closed
1 year ago
0
DCGM_FI_DEV_GPU_UTIL with MIG devices
#80
devnjw
closed
1 year ago
4
Previous
Next