Open nvvfedorov opened 3 months ago
@jacksonyi0, more info is needed to find the cause. Based on shared log messages, I suspect that the container you're running doesn't have enough permissions. Try running the container in privileged mode. Since it needs to access GPU data, it requires root-level privileges.
@nvvfedorov I am not getting the desired result here, I have installed dcgm nad nv hostengine on my GPU machine
time="2024-05-24T04:53:22Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T04:53:22Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T04:53:22Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-24T04:53:22Z" level=info msg="Pipeline starting"
time="2024-05-24T04:53:22Z" level=info msg="Starting webserver"
```root@ip-10-20-61-45 dcgm]# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Loaded |
| 8 | Profiling | Not loaded |
| 9 | SysMon | Not loaded |
+-----------+--------------------+--------------------------------------------------+
```[root@ip-10-20-61-45 etc]# nv-hostengine -f host.log --log-level debug
Host engine already running with pid 1240135
```+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01 Driver Version: 515.105.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 27C P0 27W / 70W | 11070MiB / 15360MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 33357 C nvidia-cuda-mps-server 23MiB |
| 0 N/A N/A 282750 M+C /usr/bin/python3 1629MiB |
| 0 N/A N/A 282754 M+C /usr/bin/python3 1629MiB |
| 0 N/A N/A 886169 M+C /usr/bin/python3 1235MiB |
| 0 N/A N/A 886170 M+C /usr/bin/python3 1237MiB |
| 0 N/A N/A 1000687 M+C /usr/bin/python3 1351MiB |
| 0 N/A N/A 1000688 M+C /usr/bin/python3 1349MiB |
| 0 N/A N/A 1232400 M+C /usr/bin/python3 2613MiB |
[root@ip-10-20-61-45 etc]# dcgmi dmon -e 1004
#Entity TENSO
ID
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
GPU 0 0.000
Could you please suggest me what is wrong here, why I am not able to get the Profiling metrics?
@PrakChandra , Please share DCGM logs, which you can find here: /var/log/nvidia-dcgm/*.log.
@nvvfedorov I do not see this particular folder nvidia-dcgm in /var/log in my container.
A new update is , When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest
, I see the following logs
time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T08:48:21Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T08:48:21Z" level=info msg="Collecting DCP Metrics"
time="2024-05-24T08:48:21Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-05-24T08:48:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-24T08:48:21Z" level=info msg="Pipeline starting"
time="2024-05-24T08:48:21Z" level=info msg="Starting webserver"```
However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good
```2024-05-27 06:51:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]```
Am I missing something?
Also the .csv file shows output like this
# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message
# Clocks,,
DCGM_FI_DEV_SM_CLOCK, gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).
# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP, gauge, GPU temperature (in C).
# Power,,
DCGM_FI_DEV_POWER_USAGE, gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).
# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT, counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT, counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.
# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL, gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL, gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL , gauge, Decoder utilization (in %).
# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS, gauge, Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION, counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION, counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION, counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION, counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).
# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.
# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE, counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE, counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.
# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL, counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL, counter, Total number of NVLink bandwidth counters for all lanes
# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status
# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE, gauge, Whether remapping of rows has failed
Try 'readlink --help' for more information. Enter the container through docker run -ti --entrypoint=/bin/sh --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04 bash /usr/local/dcgm/dcgm-exporter-entrypoint.sh still reports an error readlink: missing operand Try 'readlink --help' for more information The error runtime/cgo: pthread_create failed: Operation not permitted is reported through the /usr/bin/dcgm-exporter command. SIGABRT: abort PC=0x7f33397539fc m=0 sigcode=18446744073709551610
goroutine 0 [idle]: runtime: g 0: unknown pc 0x7f33397539fc stack: frame={sp:0x7ffdbe6fa820, fp:0x0} stack=[0x7ffdbdefbda0,0x7ffdbe6fadb0) 0x00007ffdbe6fa720: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa730: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa740: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa750: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa760: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa770: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa780: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa790: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7a0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7b0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7c0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7d0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7e0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7f0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa800: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa810: 0x0000000000000000 0x00007f33397539ee 0x00007ffdbe6fa820: <0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa830: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa840: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa850: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa860: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa870: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa880: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa890: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa8a0: 0x0000000000000000 0xa8d8867e7227a900 0x00007ffdbe6fa8b0: 0x00007f33396ba740 0x0000000000000006 0x00007ffdbe6fa8c0: 0x0000000001d0e4f7 0x00007ffdbe6fabf0 0x00007ffdbe6fa8d0: 0x0000000002992bc0 0x00007f33396ff476 0x00007ffdbe6fa8e0: 0x00007f33398d8e90 0x00007f33396e57f3 0x00007ffdbe6fa8f0: 0x0000000000000020 0x0000000000000000 0x00007ffdbe6fa900: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa910: 0x0000000000000000 0x0000000000000000 runtime: g 0: unknown pc 0x7f33397539fc What is causing this problem? Please help.
Originally posted by @jacksonyi0 in https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-2122663789