issues
search
NVIDIA
/
DCGM
NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
393
stars
50
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Fault injection in my pytorch training job
#192
hjx620
opened
3 weeks ago
1
Running separate DCGM on Kubernetes cluster
#191
ysk24ok
opened
1 month ago
5
About dcgmError_enum 94 - DCGM_FR_EUD_NON_ZERO_EXIT_CODE. Anyone can told me what is reason to this case?
#190
Pig255
closed
3 weeks ago
3
DCGM 3.3.8
#189
dshaiknvidia
closed
1 month ago
2
Prebuilt binaries in the source tree not compiled from source
#188
xnox
opened
1 month ago
0
c++: add missing includes
#187
xnox
opened
1 month ago
0
the issue of watching and querying metrics
#186
BetaZYN
opened
1 month ago
0
dcgmi policy about Reset GPU Not effective
#185
mr-j-1992
opened
1 month ago
0
DCGM Policy Violation Notification channel reporting too many PCIe violations on P5 instance type from AWS EC2 (H100)
#184
haardm
opened
2 months ago
1
H800 can not open profile feature? Help...
#183
cc8476
opened
2 months ago
1
Getting Utilization metrics
#182
apaz-cli
opened
2 months ago
0
Support for Amazon Linux 2023 (AL2023)
#181
mbacchi
opened
2 months ago
0
DCGM 3.3.7
#180
dshaiknvidia
closed
2 months ago
0
Questions about enabling remote telemetry permissions
#179
MarcelFerrari
opened
2 months ago
0
Add support for IPv6
#178
johnsushant
opened
3 months ago
9
How to get current memory usage in bytes from dmon?
#177
johnsushant
closed
3 months ago
1
Error: Health watches not enabled. Please enable watches
#176
corrtia
opened
3 months ago
0
Cannot Retrieve GPU PIDs from DCGM Metrics
#175
doronkg
opened
3 months ago
0
Hello, why /var/log/nv-hostengine.log file had many ERROR [5231:5273] [[NvSwitch]] ReadNvSwitchStatusAllSwitches()
#174
13416157913
opened
4 months ago
0
Cannot get "nvlink_flit_crc_error_count_total(409)" and "nvlink_data_crc_error_count_total(419)" in H800 System
#173
zdyang
opened
4 months ago
2
DCGM 3.3.6
#172
dshaiknvidia
closed
5 months ago
0
Incorrect values reported by dcgm stats
#171
MarcelFerrari
closed
4 months ago
3
[Question] Amount of lag expected for metrics
#170
jaywonchung
opened
5 months ago
2
[Question] Understanding multiplexing of profiling counters
#169
jaywonchung
closed
5 months ago
2
Error setting up dcgm with startHostEngine mode from a golang based container
#168
haardm
closed
5 months ago
1
Running diagnostics causes the Memory Usage of the other GPUs to increase
#167
BetaZYN
opened
5 months ago
0
Facing unknown docker flag --compress while using build.sh
#166
premalathak12
opened
5 months ago
0
Facing error in running sdk_sample DCGMReader.py
#165
premalathak12
opened
5 months ago
2
Metrics around capturing gpu FLOPS
#164
krishh85
opened
5 months ago
4
Memory usage by dcgm during runtime diagnostics
#163
BetaZYN
opened
5 months ago
2
a question about dcgm policy listening for xid
#162
BetaZYN
opened
5 months ago
2
Removal of dependencies on cuda v10
#161
mamccorm
closed
1 month ago
8
Build output does not include libnvperf_dcgm_host.so
#160
pintohutch
closed
6 months ago
13
log spam of [[NvSwitch]] Not attached to NvSwitches. Aborting in cuda-dcgm-3.1.3.1 via Bright Cluster, RHEL 8
#159
SomePersonSomeWhereInTheWorld
opened
6 months ago
8
dcgm dagnostic command exits with status 226
#158
rajeshvenkata
opened
6 months ago
1
`power_usage` vs. `power_usage_instant`?
#157
jaywonchung
closed
5 months ago
1
How to get SM Occupancy in real-time except dcgm in RTX Series?
#156
taekyounghan
closed
6 months ago
1
dcgm-exporter crashes hostengine.
#155
krono
opened
7 months ago
28
DCGM 3.3.5
#154
nikkon-dev
closed
7 months ago
0
AppArmor profile for DCGM
#153
pintohutch
opened
7 months ago
3
DCGM_FI_PROF_SM_ACTIVE is showing a value higher than 100% for MIG devices
#152
marceloamaral
opened
8 months ago
5
Error setting watches. Result: The third-party Profiling module returned an unrecoverable error
#151
marceloamaral
closed
8 months ago
4
DCGM cannot listen on ipv6 address
#150
Pingan2017
opened
8 months ago
0
No NVLINK activity on DGX-A100 320GB
#149
itzsimpl
opened
8 months ago
6
diag --configfile option is silently ignored if --parameters options is present
#148
dmonakhov
opened
8 months ago
0
DCGM 3.3.3
#147
nikkon-dev
closed
8 months ago
0
Does DCGM supports creating groups of GPU from different hosts?
#146
deferen2
opened
9 months ago
1
Does DCGM support profiling metrics for A10 ?
#144
xuchenhui-5
opened
9 months ago
9
When I run diagnostics, the two GPUs in the group both get failed results.
#145
BetaZYN
opened
9 months ago
2
common/DCGMStringHelpers: use memccpy for performance & error handling
#142
ecbadeaux
opened
9 months ago
1
Next