NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

Errors in nv-hostengine log #141

Open itzsimpl opened 7 months ago

itzsimpl commented 7 months ago

We use dcdm-exporter as described in https://docs.nvidia.com/datacenter/cloud-native/gpu-telemetry/latest/dcgm-exporter.html#connecting-to-an-existing-dcgm-agent. The nv-hostengine is version 3.1.8, the dcgm-exporter container is nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04.

We use a custom metrics file with the following metrics:

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
#DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# PCIe,,
DCGM_FI_PROF_PCIE_TX_BYTES,      gauge, The rate of data including both protocol headers and payload transmitted over PCIe bus (in B/s).
DCGM_FI_PROF_PCIE_RX_BYTES,      gauge, The rate of data including both protocol headers and payload received over PCIe bus (in B/s).

# NVLink,,
DCGM_FI_PROF_NVLINK_TX_BYTES,                    gauge, The rate of data not including protocol headers transmitted over NVLink (in B/s).
DCGM_FI_PROF_NVLINK_RX_BYTES,                    gauge, The rate of data not including protocol headers received over NVLink (in B/s).

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed

# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).

On a DGX-H100 system, with DGXOS6 installed and latest FW updates I have noticed the following errors in the nv-hostengine logs.

2023-12-20 06:44:42.219 ERROR [8826:13917] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
2023-12-20 06:44:42.219 ERROR [8826:13917] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Any ideas what these are?

In addition, if we enable DCGM_FI_DEV_XID_ERRORS then the logs get filled quite quickly by the following ERROR:

2023-12-19 13:58:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.955 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:58:47.958 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.956 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.956 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:17.957 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.951 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.952 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 2 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 3 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.953 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 4 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.954 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 5 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.954 ERROR [8826:8828] GetLatestSample returned No data is available for entityId 6 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 13:59:47.955 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 7 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
2023-12-19 14:00:17.953 ERROR [8826:8827] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4524] [DcgmCacheManager::GetMultipleLatestSamples]
nikkon-dev commented 7 months ago

@itzsimpl,

Can you please check the dmesg messages and confirm if you are using the GSP driver?

itzsimpl commented 7 months ago

@nikkon-dev

The installed drivers are 535.129.03, based on https://download.nvidia.com/XFree86/Linux-x86_64/535.129.03/README/gsp.html,

# nvidia-smi -q | grep -i gsp
    GSP Firmware Version                  : 535.129.03

but

# cat /proc/driver/nvidia/gpus/0000\:1b\:00.0/information 
Model:           NVIDIA H100 80GB HBM3
IRQ:             18
GPU UUID:        GPU-875f3ca0-9de4-e78c-9cea-5140b030b627
Video BIOS:      96.00.89.00.01
Bus Type:        PCIe
DMA Size:        52 bits
DMA Mask:        0xfffffffffffff
Bus Location:    0000:1b:00.0
Device Minor:    0
GPU Firmware:    535.129.03
GPU Excluded:    No

I don't see GSP mentioned in dmesg.

Could you provide more details on to what to look for in dmesg?

jfolz commented 6 months ago

We see the same errors on DGX-H100. Same nv-hostengine, driver, (GSP) firmware, etc.

ERROR [597577:597597] Got nvml st 10 from nvmlGpmSampleGet(). [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmGpmManager.cpp:147] [DcgmGpmManagerEntity::MaybeFetchNewSample]
ERROR [597577:597597] Got unexpected return -8 from m_gpmManager.GetLatestSample [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10473] [DcgmCacheManager::BufferOrCacheLatestGpuValue]

Not sure if this is relevant, but here's the metrics collected by dcgm-exporter.

DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes.
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).
DCGM_FI_PROF_PCIE_TX_BYTES,      counter, The number of bytes of active pcie tx data including both header and payload.
DCGM_FI_PROF_PCIE_RX_BYTES,      counter, The number of bytes of active pcie rx data including both header and payload.

@nikkon-dev please let us know if you need additional information, and if this is a DCGM or dcgm-exporter issue.

itzsimpl commented 5 months ago

@nikkon-dev Any news on this?

Upgraded to dcgm 3.3.3 and dcgm-exporter 3.3.3-3.2.0

# dcgmi -v
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Hostengine build info:
Version : 3.3.3
Build ID : 11
Build Date : 2024-01-18
Build Type : Release
Commit ID : c3aed64480553cd5ba1a32d165c7967936446631
Branch Name : rel_dcgm_3_3
CPU Arch : x86_64
Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
CRC : 39ff792f5514bdc5af56ce313a8fa90e

Still seeing the errors, but found also

2024-02-01 13:34:36.214 ERROR [8878:8879] [[SysMon]] Couldn't open CPU Vendor info file file '/sys/devices/soc0/soc_id' for reading: '@֟�O^?' [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/sysmon/DcgmSystemMonitor.cpp:261] [DcgmSystemMonitor::ReadCpuVendorAndModel]
2024-02-01 13:34:36.215 ERROR [8878:8879] [[SysMon]] A runtime exception occured when creating module. Ex: Incompatible hardware vendor for sysmon. [/workspaces/dcgm-rel_dcgm_3_3-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-01 13:34:36.215 ERROR [8878:8879] Failed to load module 9 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1831] [DcgmHostEngineHandler::LoadModule]

The system has the latest DGXOS 6.1, latest fw 1.1.3, and all ubuntu packages upgrades applied; the driver is 535.154.05.

pintohutch commented 3 months ago

FWIW I'm seeing these same messages using libraries from the nvcr.io/nvidia/cloud-native/dcgm:3.3.3-1-ubuntu22.04 container.

time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR
time="2024-04-04T20:54:46Z" level=info msg="GetLatestSample returned No data is available for entityId 1 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4528] [DcgmCacheManager::GetMultipleLatestSamples]" dcgm_level=ERROR

For the GPU firmare:

cat /proc/driver/nvidia/gpus/0000:00:03.0/information
Model:       NVIDIA L4
IRQ:         11
GPU UUID:    GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3
Video BIOS:      95.04.29.00.07
Bus Type:    PCI
DMA Size:    47 bits
DMA Mask:    0x7fffffffffff
Bus Location:    0000:00:03.0
Device Minor:    0
GPU Firmware:    535.129.03
GPU Excluded:    No
pintohutch commented 3 months ago

Though I want to point out, I'm deploying dcgm-exporter along with the DCGM libraries in "embedded mode" and I can see it exposing 0-value metrics for Field 230 (DCGM_FI_DEV_XID_ERRORS):

# HELP DCGM_FI_DEV_XID_ERRORS Value of the last XID error encountered.
# TYPE DCGM_FI_DEV_XID_ERRORS gauge
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="0",UUID="GPU-7109a529-1f03-bf39-329f-0c4b56c7e6e3",device="nvidia0",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-55dd4b885c-4tgcq"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0
DCGM_FI_DEV_XID_ERRORS{gpu="1",UUID="GPU-68abb810-d9e9-268e-868a-8ad9599230b5",device="nvidia1",modelName="NVIDIA L4",Hostname="gke-danny-gpu-pool-1-ad82d1c8-trvg",DCGM_FI_DEV_ECC_CURRENT="1",container="dcgm-loadtest",namespace="default",pod="dcgm-loadtest-7475455d7-9qh22"} 0

So I dunno if dcgm-exporter is elegantly handling/defaulting here in the case of errors, or if its reporting an incorrect metric because of real errors from DCGM.

PrakChandra commented 1 month ago

I am also facing the same issue where the logs are in error state.

When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the following logs

time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter" time="2024-05-24T08:48:21Z" level=info msg="DCGM successfully initialized!" time="2024-05-24T08:48:21Z" level=info msg="Collecting DCP Metrics" time="2024-05-24T08:48:21Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv" time="2024-05-24T08:48:21Z" level=info msg="Kubernetes metrics collection enabled!" time="2024-05-24T08:48:21Z" level=info msg="Pipeline starting" time="2024-05-24T08:48:21Z" level=info msg="Starting webserver"```

However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good


2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2

The .csv file is as follows

```root@dcgm-exporter-28jbg:/etc/dcgm-exporter# cat default-counters.csv
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed