NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
843 stars 151 forks source link

hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand #327

Open nvvfedorov opened 3 months ago

nvvfedorov commented 3 months ago
          hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand

Try 'readlink --help' for more information. Enter the container through docker run -ti --entrypoint=/bin/sh --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04 bash /usr/local/dcgm/dcgm-exporter-entrypoint.sh still reports an error readlink: missing operand Try 'readlink --help' for more information The error runtime/cgo: pthread_create failed: Operation not permitted is reported through the /usr/bin/dcgm-exporter command. SIGABRT: abort PC=0x7f33397539fc m=0 sigcode=18446744073709551610

goroutine 0 [idle]: runtime: g 0: unknown pc 0x7f33397539fc stack: frame={sp:0x7ffdbe6fa820, fp:0x0} stack=[0x7ffdbdefbda0,0x7ffdbe6fadb0) 0x00007ffdbe6fa720: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa730: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa740: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa750: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa760: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa770: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa780: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa790: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7a0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7b0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7c0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7d0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7e0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7f0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa800: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa810: 0x0000000000000000 0x00007f33397539ee 0x00007ffdbe6fa820: <0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa830: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa840: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa850: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa860: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa870: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa880: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa890: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa8a0: 0x0000000000000000 0xa8d8867e7227a900 0x00007ffdbe6fa8b0: 0x00007f33396ba740 0x0000000000000006 0x00007ffdbe6fa8c0: 0x0000000001d0e4f7 0x00007ffdbe6fabf0 0x00007ffdbe6fa8d0: 0x0000000002992bc0 0x00007f33396ff476 0x00007ffdbe6fa8e0: 0x00007f33398d8e90 0x00007f33396e57f3 0x00007ffdbe6fa8f0: 0x0000000000000020 0x0000000000000000 0x00007ffdbe6fa900: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa910: 0x0000000000000000 0x0000000000000000 runtime: g 0: unknown pc 0x7f33397539fc What is causing this problem? Please help.

Originally posted by @jacksonyi0 in https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-2122663789

nvvfedorov commented 3 months ago

@jacksonyi0, more info is needed to find the cause. Based on shared log messages, I suspect that the container you're running doesn't have enough permissions. Try running the container in privileged mode. Since it needs to access GPU data, it requires root-level privileges.

PrakChandra commented 3 months ago

@nvvfedorov I am not getting the desired result here, I have installed dcgm nad nv hostengine on my GPU machine


time="2024-05-24T04:53:22Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T04:53:22Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T04:53:22Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-24T04:53:22Z" level=info msg="Pipeline starting"
time="2024-05-24T04:53:22Z" level=info msg="Starting webserver"

```root@ip-10-20-61-45 dcgm]# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Loaded                                           |
| 8         | Profiling          | Not loaded                                       |
| 9         | SysMon             | Not loaded                                       |
+-----------+--------------------+--------------------------------------------------+

```[root@ip-10-20-61-45 etc]# nv-hostengine -f host.log --log-level debug
Host engine already running with pid 1240135

```+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.105.01   Driver Version: 515.105.01   CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   27C    P0    27W /  70W |  11070MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     33357      C   nvidia-cuda-mps-server             23MiB |
|    0   N/A  N/A    282750    M+C   /usr/bin/python3                 1629MiB |
|    0   N/A  N/A    282754    M+C   /usr/bin/python3                 1629MiB |
|    0   N/A  N/A    886169    M+C   /usr/bin/python3                 1235MiB |
|    0   N/A  N/A    886170    M+C   /usr/bin/python3                 1237MiB |
|    0   N/A  N/A   1000687    M+C   /usr/bin/python3                 1351MiB |
|    0   N/A  N/A   1000688    M+C   /usr/bin/python3                 1349MiB |
|    0   N/A  N/A   1232400    M+C   /usr/bin/python3                 2613MiB |

[root@ip-10-20-61-45 etc]# dcgmi dmon -e 1004
#Entity   TENSO
ID
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000
GPU 0     0.000

Could you please suggest me what is wrong here, why I am not able to get the Profiling metrics?
nvvfedorov commented 3 months ago

@PrakChandra , Please share DCGM logs, which you can find here: /var/log/nvidia-dcgm/*.log.

PrakChandra commented 3 months ago

@nvvfedorov I do not see this particular folder nvidia-dcgm in /var/log in my container. A new update is , When I change the tag to latest for the dcgm image nvcr.io/nvidia/k8s/dcgm-exporter:latest , I see the following logs


time="2024-05-24T08:48:21Z" level=info msg="Starting dcgm-exporter"
time="2024-05-24T08:48:21Z" level=info msg="DCGM successfully initialized!"
time="2024-05-24T08:48:21Z" level=info msg="Collecting DCP Metrics"
time="2024-05-24T08:48:21Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2024-05-24T08:48:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2024-05-24T08:48:21Z" level=info msg="Pipeline starting"
time="2024-05-24T08:48:21Z" level=info msg="Starting webserver"```

However, I am not getting the metrics on Grafana. I can see the nv-hostengine logs which do not look good

```2024-05-27 06:51:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.485 ERROR [1:31] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:21.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.485 ERROR [1:28] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:52:51.530 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:53:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.485 ERROR [1:33] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:21.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned No data is available for entityId 0 groupId 1 fieldId 230 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.484 ERROR [1:30] GetLatestSample returned Feature not supported for entityId 0 groupId 1 fieldId 449 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:4538] [DcgmCacheManager::GetMultipleLatestSamples]
2024-05-27 06:54:51.531 ERROR [1:25] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:10168] [DcgmCacheManager::ReadAndCacheNvLinkBandwidthTotal]```

Am I missing something?
PrakChandra commented 3 months ago

Also the .csv file shows output like this


# Format,,
# If line starts with a '#' it is considered a comment,,
# DCGM FIELD, Prometheus metric type, help message

# Clocks,,
DCGM_FI_DEV_SM_CLOCK,  gauge, SM clock frequency (in MHz).
DCGM_FI_DEV_MEM_CLOCK, gauge, Memory clock frequency (in MHz).

# Temperature,,
DCGM_FI_DEV_MEMORY_TEMP, gauge, Memory temperature (in C).
DCGM_FI_DEV_GPU_TEMP,    gauge, GPU temperature (in C).

# Power,,
DCGM_FI_DEV_POWER_USAGE,              gauge, Power draw (in W).
DCGM_FI_DEV_TOTAL_ENERGY_CONSUMPTION, counter, Total energy consumption since boot (in mJ).

# PCIE,,
DCGM_FI_DEV_PCIE_TX_THROUGHPUT,  counter, Total number of bytes transmitted through PCIe TX (in KB) via NVML.
DCGM_FI_DEV_PCIE_RX_THROUGHPUT,  counter, Total number of bytes received through PCIe RX (in KB) via NVML.
DCGM_FI_DEV_PCIE_REPLAY_COUNTER, counter, Total number of PCIe retries.

# Utilization (the sample period varies depending on the product),,
DCGM_FI_DEV_GPU_UTIL,      gauge, GPU utilization (in %).
DCGM_FI_DEV_MEM_COPY_UTIL, gauge, Memory utilization (in %).
DCGM_FI_DEV_ENC_UTIL,      gauge, Encoder utilization (in %).
DCGM_FI_DEV_DEC_UTIL ,     gauge, Decoder utilization (in %).

# Errors and violations,,
DCGM_FI_DEV_XID_ERRORS,            gauge,   Value of the last XID error encountered.
# DCGM_FI_DEV_POWER_VIOLATION,       counter, Throttling duration due to power constraints (in us).
# DCGM_FI_DEV_THERMAL_VIOLATION,     counter, Throttling duration due to thermal constraints (in us).
# DCGM_FI_DEV_SYNC_BOOST_VIOLATION,  counter, Throttling duration due to sync-boost constraints (in us).
# DCGM_FI_DEV_BOARD_LIMIT_VIOLATION, counter, Throttling duration due to board limit constraints (in us).
# DCGM_FI_DEV_LOW_UTIL_VIOLATION,    counter, Throttling duration due to low utilization (in us).
# DCGM_FI_DEV_RELIABILITY_VIOLATION, counter, Throttling duration due to reliability constraints (in us).

# Memory usage,,
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

# ECC,,
# DCGM_FI_DEV_ECC_SBE_VOL_TOTAL, counter, Total number of single-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_DBE_VOL_TOTAL, counter, Total number of double-bit volatile ECC errors.
# DCGM_FI_DEV_ECC_SBE_AGG_TOTAL, counter, Total number of single-bit persistent ECC errors.
# DCGM_FI_DEV_ECC_DBE_AGG_TOTAL, counter, Total number of double-bit persistent ECC errors.

# Retired pages,,
# DCGM_FI_DEV_RETIRED_SBE,     counter, Total number of retired pages due to single-bit errors.
# DCGM_FI_DEV_RETIRED_DBE,     counter, Total number of retired pages due to double-bit errors.
# DCGM_FI_DEV_RETIRED_PENDING, counter, Total number of pages pending retirement.

# NVLink,,
# DCGM_FI_DEV_NVLINK_CRC_FLIT_ERROR_COUNT_TOTAL, counter, Total number of NVLink flow-control CRC errors.
# DCGM_FI_DEV_NVLINK_CRC_DATA_ERROR_COUNT_TOTAL, counter, Total number of NVLink data CRC errors.
# DCGM_FI_DEV_NVLINK_REPLAY_ERROR_COUNT_TOTAL,   counter, Total number of NVLink retries.
# DCGM_FI_DEV_NVLINK_RECOVERY_ERROR_COUNT_TOTAL, counter, Total number of NVLink recovery errors.
DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL,            counter, Total number of NVLink bandwidth counters for all lanes

# VGPU License status,,
DCGM_FI_DEV_VGPU_LICENSE_STATUS, gauge, vGPU License status

# Remapped rows,,
DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS, counter, Number of remapped rows for uncorrectable errors
DCGM_FI_DEV_CORRECTABLE_REMAPPED_ROWS,   counter, Number of remapped rows for correctable errors
DCGM_FI_DEV_ROW_REMAP_FAILURE,           gauge,   Whether remapping of rows has failed