issues
search
NVIDIA
/
DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
352
stars
48
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Error: Health watches not enabled. Please enable watches
#176
corrtia
opened
1 week ago
0
Cannot Retrieve GPU PIDs from DCGM Metrics
#175
doronkg
opened
1 week ago
0
Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches()
#174
13416157913
opened
1 month ago
0
Cannot get "nvlink_flit_crc_error_count_total(409)" and "nvlink_data_crc_error_count_total(419)" in H800 System
#173
zdyang
opened
1 month ago
2
DCGM 3.3.6
#172
dshaiknvidia
closed
1 month ago
0
Incorrect values reported by dcgm stats
#171
MarcelFerrari
closed
1 month ago
3
[Question] Amount of lag expected for metrics
#170
jaywonchung
opened
1 month ago
2
[Question] Understanding multiplexing of profiling counters
#169
jaywonchung
closed
1 month ago
2
Error setting up dcgm with startHostEngine mode from a golang based container
#168
haardm
closed
1 month ago
1
Running diagnostics causes the Memory Usage of the other GPUs to increase
#167
BetaZYN
opened
1 month ago
0
Facing unknown docker flag --compress while using build.sh
#166
premalathak12
opened
2 months ago
0
Facing error in running sdk_sample DCGMReader.py
#165
premalathak12
opened
2 months ago
2
Metrics around capturing gpu FLOPS
#164
krishh85
opened
2 months ago
4
Memory usage by dcgm during runtime diagnostics
#163
BetaZYN
opened
2 months ago
2
a question about dcgm policy listening for xid
#162
BetaZYN
opened
2 months ago
2
Removal of dependencies on cuda v10
#161
mamccorm
opened
2 months ago
7
Build output does not include libnvperf_dcgm_host.so
#160
pintohutch
closed
3 months ago
13
log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8
#159
SomePersonSomeWhereInTheWorld
opened
3 months ago
8
dcgm dagnostic command exits with status 226
#158
rajeshvenkata
opened
3 months ago
1
`power_usage` vs. `power_usage_instant`?
#157
jaywonchung
closed
1 month ago
1
How to get SM Occupancy in real-time except dcgm in RTX Series?
#156
taekyounghan
closed
3 months ago
1
dcgm-exporter crashes hostengine.
#155
krono
opened
4 months ago
23
DCGM 3.3.5
#154
nikkon-dev
closed
4 months ago
0
AppArmor profile for DCGM
#153
pintohutch
opened
4 months ago
3
DCGM_FI_PROF_SM_ACTIVE is showing a value higher than 100% for MIG devices
#152
marceloamaral
opened
4 months ago
5
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error
#151
marceloamaral
closed
4 months ago
4
DCGM cannot listen on ipv6 address
#150
Pingan2017
opened
5 months ago
0
No NVLINK activity on DGX-A100 320GB
#149
itzsimpl
opened
5 months ago
6
diag --configfile option is silently ignored if --parameters options is present
#148
dmonakhov
opened
5 months ago
0
DCGM 3.3.3
#147
nikkon-dev
closed
5 months ago
0
Does DCGM supports creating groups of GPU from different hosts?
#146
deferen2
opened
5 months ago
1
Does DCGM support profiling metrics for A10 ?
#144
xuchenhui-5
opened
6 months ago
9
When I run diagnostics, the two GPUs in the group both get failed results.
#145
BetaZYN
opened
6 months ago
2
common/DCGMStringHelpers: use memccpy for performance & error handling
#142
ecbadeaux
opened
6 months ago
1
Errors in nv-hostengine log
#141
itzsimpl
opened
6 months ago
7
Previous profiling results are still stored in dcgmGroup.samples.GetAllSinceLastCall
#140
optyang
opened
6 months ago
0
dcgmi diag multiple tests skipped
#139
disjustin
closed
5 months ago
4
DCGM_FI_PROF_GR_ENGINE_ACTIVE and MIG
#138
neggert
opened
7 months ago
10
DCGM 3.3.1
#137
nikkon-dev
closed
7 months ago
0
dcgm diag pcie test hangs indefinitely for H100 80GB HBM3
#136
disjustin
closed
6 months ago
1
@nguoido,
#135
disjustin
closed
7 months ago
0
New segmentation fault from version v3.3.0
#134
hanwen-pcluste
opened
7 months ago
4
Can DCGM achieve obtaining gpu information of another host?
#133
jxh314
closed
2 months ago
2
How to get the module profile loaded?
#132
jxh314
opened
8 months ago
7
Feature/local cmake
#131
jmikedupont2
closed
8 months ago
0
Feature/moderncpp
#130
jmikedupont2
opened
8 months ago
0
#include "newrandom.h"
#129
jmikedupont2
opened
8 months ago
1
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 changes
#128
jmikedupont2
closed
8 months ago
5
Old data are copied into new data in dcgmGroupSamples.GetAllSinceLastCall
#127
optyang
opened
8 months ago
0
device memory ECC Errors can not take effect
#126
xiaohai1234
closed
8 months ago
2
Next