NVIDIA / dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Apache License 2.0
858 stars 152 forks source link

Profiling metrics not being collected #22

Closed ppreet closed 6 months ago

ppreet commented 2 years ago

Hello,

dcgmi version: 2.2.9

I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can't seem to get profiling metrics to show up, though other metrics show up fine.

root@node-0:/etc/dcgm-exporter# dcgm-exporter -f etc/dcp-metrics-included.csv  -a :9402
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] No configmap data specified, falling back to metric file etc/dcp-metrics-included.csv
WARN[0000] Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled
WARN[0000] Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled

Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.

It looks like the profiling module fails to load:

root@node-0:/etc/dcgm-exporter# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

Though I'm not sure whether this is attributable to dcgm-exporter or dcgm, because when I can't get the metrics to load even when using dcgmi directly:

root@node-0:/home/user# dcgmi dmon -e 1010
# Entity                 PCIRX
      Id
Error setting watches. Result: This request is serviced by a module of DCGM that is not currently loaded

I've directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.

How can I enable the collection of profiling metrics?

nikkon-dev commented 2 years ago

Hello,

The DCP metrics (fieldid 1001-1012) are supported on Volta and newer architectures only. Kepler is not supported.

WBR, Nik

babinskiy commented 2 years ago

Hello,

I have Ampere A40 GPU, but I also have the same error:

dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

What could be reason of this?

nikkon-dev commented 2 years ago

@babinskiy,

There may be several reasons. Could you provide us the debug logs from the nv-hostengine? nv-hostengine -f host.log --log-level debug

WBR, Nik

babinskiy commented 2 years ago

HI @nikkon-dev, Thanks for your response

The only related I found in log

2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 10 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Added GroupId 2 name dcgmi_22409_1 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:273] [DcgmGroupManager::AddNewGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 47 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5763] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000000 (eg 1, entityId 0, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 0, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000001 (eg 1, entityId 1, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 1, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7f70fb244028) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Returning 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:91] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Logger address 0x7f70f8294740 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:92] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:450] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]
2022-04-26 06:20:26.722 DEBUG [22375:22377] [[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1215] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1216] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:385] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2022-04-26 06:20:26.723 ERROR [22375:22377] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper]
2022-04-26 06:20:26.723 ERROR [22375:22377] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3617] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5828] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5870] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 0, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 1, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 WARN  [22375:22377] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3534] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5914] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:44.586 DEBUG [22375:22377] Processing request of type 3 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:44.586 DEBUG [22375:22377] persistAfterDisconnect 0 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:231] [DcgmHostEngineHandler::ProcessClientLogin]
2022-04-26 06:20:44.587 DEBUG [22375:22377] Removed 0 groups for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:364] [DcgmGroupManager::RemoveAllGroupsForConnection]
2022-04-26 06:20:44.587 DEBUG [22375:22377] No field groups found for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmFieldGroup.cpp:392] [DcgmFieldGroupManager::OnConnectionRemove]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
...

Full version of log I uploaded here: https://fex.net/s/2p0p1bm

Will be grateful for any help!

nikkon-dev commented 2 years ago

@babinskiy,

Could you confirm that the persistence mode is enabled on the GPU? nvidia-smi output would tell you. nvidia-smi -pm 1 to enable it.

yh0413 commented 1 year ago

Hi @nikkon-dev,

I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 , but same problem occurs like above.

Any help would be appreciated.

Env

Apps related NVIDIA

dcgm-exporter log

time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter"
time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!"
time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-11-21T05:43:49Z" level=info msg="Starting webserver"
time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"
nikkon-dev commented 1 year ago

@yh0413,

Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information. For example, this is how you could run a docker container to allow it to access the MIG API:

$ docker run --cap-add SYS_ADMIN --runtime=nvidia \
  --gpus all \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_MIG_CONFIG_DEVICES=all \
  -e NVIDIA_MIG_MONITOR_DEVICES=all \
  ...

Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine.

I hope that would help.

WBR, Nik

yh0413 commented 1 year ago

It works well. I solved the issue by connecting dcgm-exporter with nv-hostengine which is running on host.

Thank you!

wpso commented 1 year ago

Hi @yh0413, My vm with MIG meet the same problem, dcgmi profiling failed to load. the cuda version is 11.4 and nvidia driver is 470.141.03, do you have any suggestions?

nikkon-dev commented 1 year ago

@wpso,

Could you provide more information about your setup? Do you use passthrough or vgpu?

wpso commented 1 year ago

@nikkon-dev we use MIG vgpu for vm. we tried three dcgm version(2.0.13,2.0.15 and 2.1.5), both the host and guest all have the problem, and the card is A100 80G(20b5).

nikkon-dev commented 1 year ago

@wpso,

I'm a bit confused. vGPUs do not allow MIG configurations until you are using the passthrough approach (aka grant exclusive access to the whole GPU to the VM). What hypervisor are you using? In general, DCGM needs full access to the hardware, and the driver to be able to reach the MIG management API which is usually not virtualized.

jack161641 commented 1 year ago

@nikkon-dev

hi,I get the same error, even if I start dcgm-exporter with nv-hostengine

_root@release-name-dcgm-exporter-b2xrs:/# dcgm-exporter -r localhost:5555 -f /etc/dcgm-exporter/custom-collectors.csv -a :9401 INFO[0000] Starting dcgm-exporter INFO[0000] Attemping to connect to remote hostengine at localhost:5555 INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/custom-collectors.csv WARN[0000] Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled WARN[0000] Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled WARN[0000] Skipping line 21 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled WARN[0000] Skipping line 22 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 23 ('DCGM_FI_PROF_DRAMACTIVE'): metric not enabled

ENV DCGM Exporter version 3.1.6-3.1.3 Driver Version : 460.91.03 CUDA Version : 11.2 Persistence-M ON Tesla V100-SXM2-32GB

nikkon-dev commented 1 year ago

@jack161641,

To determine the cause of the profiling module load failure, we must analyze the nv-hostengine debug logs. The reasons could be varied, ranging from unsupported GPUs to insufficient privileges.

To obtain the debug logs, you can restart the nv-hostengine with the following arguments: nv-hostengine -f /tmp/host.debug.log --log-level-debug

chenaidong1 commented 10 months ago

I have the same problem.

Environment # dcgmi -v Version : 2.4.6 Build ID : 11 Build Date : 2022-07-06 Build Type : Release Commit ID : b21fb88d38b2d70a5b3330e5806962ad6f207e69 Branch Name : rel_dcgm_2_4 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64

nvidia-smi cmd result: ` Wed Nov 15 07:27:20 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L40S On | 00000000:27:00.0 Off | 0 | | N/A 39C P8 34W / 350W | 3MiB / 46068MiB | 0% Default | | | | N/A |

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=====================================================================================`

I can't seem to get profiling metrics to show up, though other metrics show up fine.

INFO[0000] Starting dcgm-exporter INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv WARN[0000] Skipping line 13 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled WARN[0000] Skipping line 14 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled WARN[0000] Skipping line 15 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 16 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled WARN[0000] Skipping line 17 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled WARN[0000] Skipping line 18 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled INFO[0000] Pipeline starting INFO[0000] Starting webserver nv-hostengine.log /var/log/nv-hostengine.log 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] AttachToNscq() returned -25 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmNvSwitchManager.cpp:317] [DcgmNs::DcgmNvSwitchMan ager::Init] 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] Could not initialize switch manager. Ret: DCGM library could not be found [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmMod uleNvSwitch.cpp:34] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1353] [Dcgm Ns::Modules::Profiling::DcgmModuleProfiling::InitLop] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModule Profiling.cpp:481] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4 -postmerge@2/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper] 2023-11-15 07:18:06.453 ERROR [104:104] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3671] [DcgmHostEngineHandler::LoadModule] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal]

Any help would be appreciated.

nikkon-dev commented 10 months ago

@chenaidong1,

In your case, you need to update the dcgm-exporter to a newer version.

You are using DCGM 2.4.6, which is quite outdated and does not support L40S GPUs. Try using dcgm-exporter based on the 3.2.x or 3.3.x releases.

chenaidong1 commented 10 months ago

@nikkon-dev
Thank for your reply.
I try to use dcgm-exporter based on the 3.2.x or 3.3.x releases to run on the host. Profiling metrics are collected now. In another environment, I use dcgm-exporter based on the 3.2.5 version, fail to collect profiling metrics

The environment information is as follows: `nvidia-smi Tue Nov 14 03:38:24 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:09.0 Off | Off | | N/A 53C P8 18W / 70W | 86MiB / 16384MiB | 0% Default |

dcgmi profile -l Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.`

What is the reason for the failure? Cuda is not installed. Does profile metrics depend on cuda?

ryan4yin commented 10 months ago

Encountered this problem on GKE's Nvidia L4 machine, fixed by upgrade the docker image of dcgm-exporter & dcgm to 3.3.0

NierYYDS commented 9 months ago

Hi @nikkon-dev,

I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 , but same problem occurs like above.

Any help would be appreciated.

Env

  • Kubernetes v1.19.9
  • A30
  • NVIDIA Driver 460.73.01 (persistence mode is enabled)

Apps related NVIDIA

  • nvidia-device-plugin v0.11.0
  • nvidia-dcgm-exporter 3.0.4-3.0.0 (starts nv-hostengine as an embedded process)

dcgm-exporter log

time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter"
time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!"
time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-11-21T05:43:49Z" level=info msg="Starting webserver"
time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"

I've discovered some driver/CUDA compatibility issues when collecting DCP metrics. NVIDIA Driver 460.73.01, which was shipped with CUDA 11.2, is not compatible with nvidia-dcgm-exporter 3.0.4-3.0.0 as it was built on CUDA 11.7. In my case, I resolved this issue by using an older image that was built on CUDA 11.2.

melikeiremguler commented 7 months ago

Hi, @nikkon-dev

We're using dcgm-exporter:3.1.8-3.1.5-ubuntu20.04 on Kubernetes(v1.26.6). Additionally, we are utilizing GRID-A100D-7-80C-MIG-7g.80gb. We have observed some errors in the dcgm-exporter pod logs.

time="2024-02-16T11:25:47Z" level=info msg="Starting dcgm-exporter"
time="2024-02-16T11:25:47Z" level=info msg="DCGM successfully initialized!"
time="2024-02-16T11:25:47Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-16T11:25:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"

It looks like the profiling module fails to load:

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Loaded                                           |
| 8         | Profiling          | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+

We encountered an error with code -33 while running the `dcgmi dmon -e 1010`` command.

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi dmon -e 1010
#Entity   PCIRX
ID
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded

$ docker run --cap-add SYS_ADMIN --runtime=nvidia \ --gpus all \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_MIG_CONFIG_DEVICES=all \ -e NVIDIA_MIG_MONITOR_DEVICES=all \ ...

We tried the command you shared on the dcgm exporter daemonset as shown below. ```javascript spec: revisionHistoryLimit: 10 selector: matchLabels: app: nvidia-dcgm-exporter template: metadata: creationTimestamp: null labels: app: nvidia-dcgm-exporter app.kubernetes.io/managed-by: gpu-operator helm.sh/chart: gpu-operator-v23.6.1 spec: containers: - env: - name: DCGM_EXPORTER_LISTEN value: :9400 - name: DCGM_EXPORTER_KUBERNETES value: "true" - name: DCGM_EXPORTER_COLLECTORS value: /etc/dcgm-exporter/dcp-metrics-included.csv - name: NVIDIA_VISIBLE_DEVICES value: all - name: NVIDIA_MIG_CONFIG_DEVICES value: all - name: NVIDIA_MIG_MONITOR_DEVICES value: all - name: NVIDIA_DRIVER_CAPABILITIES value: all image: nvcr.io/nvidia/k8s/dcgm-exporter:3.1.8-3.1.5-ubuntu20.04 imagePullPolicy: IfNotPresent name: nvidia-dcgm-exporter ports: - containerPort: 9400 name: metrics protocol: TCP securityContext: capabilities: add: - SYS_ADMIN privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/lib/kubelet/pod-resources name: pod-gpu-resources readOnly: true dnsConfig: options: - name: ndots value: "2" dnsPolicy: ClusterFirst initContainers: - args: - until [ -f /run/nvidia/validations/toolkit-ready ]; do echo waiting for nvidia container stack to be setup; sleep 5; done command: - sh - -c image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.6.1 imagePullPolicy: IfNotPresent name: toolkit-validation securityContext: capabilities: add: - SYS_ADMIN privileged: true terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /run/nvidia mountPropagation: HostToContainer name: run-nvidia nodeSelector: nvidia.com/gpu.deploy.dcgm-exporter: "true" priorityClassName: system-node-critical restartPolicy: Always runtimeClassName: nvidia schedulerName: default-scheduler serviceAccount: nvidia-dcgm-exporter serviceAccountName: nvidia-dcgm-exporter terminationGracePeriodSeconds: 30 tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists volumes: - hostPath: path: /var/lib/kubelet/pod-resources type: "" name: pod-gpu-resources - hostPath: path: /run/nvidia type: "" name: run-nvidia ```
We also want to share the node label added by the gpu-operator. ```javascript Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux feature.node.kubernetes.io/cpu-cpuid.ADX=true feature.node.kubernetes.io/cpu-cpuid.AESNI=true feature.node.kubernetes.io/cpu-cpuid.AVX=true feature.node.kubernetes.io/cpu-cpuid.AVX2=true feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true feature.node.kubernetes.io/cpu-cpuid.AVX512F=true feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true feature.node.kubernetes.io/cpu-cpuid.AVXVNNIINT8=true feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true feature.node.kubernetes.io/cpu-cpuid.FMA3=true feature.node.kubernetes.io/cpu-cpuid.FSRM=true feature.node.kubernetes.io/cpu-cpuid.FXSR=true feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true feature.node.kubernetes.io/cpu-cpuid.GFNI=true feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR=true feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true feature.node.kubernetes.io/cpu-cpuid.IBPB=true feature.node.kubernetes.io/cpu-cpuid.LAHF=true feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true feature.node.kubernetes.io/cpu-cpuid.MOVBE=true feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true feature.node.kubernetes.io/cpu-cpuid.PSFD=true feature.node.kubernetes.io/cpu-cpuid.SHA=true feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true feature.node.kubernetes.io/cpu-cpuid.STIBP=true feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true feature.node.kubernetes.io/cpu-cpuid.SYSEE=true feature.node.kubernetes.io/cpu-cpuid.VAES=true feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true feature.node.kubernetes.io/cpu-cpuid.WBNOINVD=true feature.node.kubernetes.io/cpu-cpuid.X87=true feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true feature.node.kubernetes.io/cpu-cpuid.XSAVE=true feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true feature.node.kubernetes.io/cpu-cpuid.XSAVES=true feature.node.kubernetes.io/cpu-hardware_multithreading=false feature.node.kubernetes.io/cpu-model.family=6 feature.node.kubernetes.io/cpu-model.id=106 feature.node.kubernetes.io/cpu-model.vendor_id=Intel feature.node.kubernetes.io/custom-rdma.available=true feature.node.kubernetes.io/kernel-config.NO_HZ=true feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true feature.node.kubernetes.io/kernel-version.full=5.15.0-91-generic feature.node.kubernetes.io/kernel-version.major=5 feature.node.kubernetes.io/kernel-version.minor=15 feature.node.kubernetes.io/kernel-version.revision=0 feature.node.kubernetes.io/memory-numa=true feature.node.kubernetes.io/pci-10de.present=true feature.node.kubernetes.io/pci-15ad.present=true feature.node.kubernetes.io/system-os_release.ID=ubuntu feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04 feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20 feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04 kubernetes.io/arch=amd64 kubernetes.io/os=linux node-role.kubernetes.io/gpu-operator= nvidia.com/cuda.driver.major=535 nvidia.com/cuda.driver.minor=154 nvidia.com/cuda.driver.rev=05 nvidia.com/cuda.runtime.major=12 nvidia.com/cuda.runtime.minor=2 nvidia.com/device-plugin.config=a100d-7-80c-mig-7g-80gb nvidia.com/gfd.timestamp=1707994189 nvidia.com/gpu-driver-upgrade-state=upgrade-done nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=0 nvidia.com/gpu.count=1 nvidia.com/gpu.deploy.container-toolkit=true nvidia.com/gpu.deploy.dcgm=true nvidia.com/gpu.deploy.dcgm-exporter=true nvidia.com/gpu.deploy.device-plugin=true nvidia.com/gpu.deploy.driver=pre-installed nvidia.com/gpu.deploy.gpu-feature-discovery=true nvidia.com/gpu.deploy.mig-manager=true nvidia.com/gpu.deploy.node-status-exporter=true nvidia.com/gpu.deploy.nvsm=paused-for-mig-change nvidia.com/gpu.deploy.operator-validator=true nvidia.com/gpu.engines.copy=7 nvidia.com/gpu.engines.decoder=5 nvidia.com/gpu.engines.encoder=0 nvidia.com/gpu.engines.jpeg=1 nvidia.com/gpu.engines.ofa=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.memory=81920 nvidia.com/gpu.multiprocessors=98 nvidia.com/gpu.present=true nvidia.com/gpu.product=GRID-A100D-7-80C-MIG-7g.80gb nvidia.com/gpu.replicas=1 nvidia.com/gpu.slices.ci=7 nvidia.com/gpu.slices.gi=7 nvidia.com/mig.capable=true nvidia.com/mig.config=all-7g.80gb nvidia.com/mig.config.state=success nvidia.com/mig.strategy=single nvidia.com/vgpu.host-driver-branch=r538_10 nvidia.com/vgpu.host-driver-version=535.154.02 nvidia.com/vgpu.present=true ```
We also ran nv-hostengine and obtained debug logs. nv-hostengine -f host.log --log-level debug ```javascript 2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling] 2024-02-16 08:59:03.773 ERROR [91:93] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper] 2024-02-16 08:59:03.773 ERROR [91:93] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1740] [DcgmHostEngineHandler::LoadModule] 2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3542] [DcgmHostEngineHandler::WatchFieldGroup] 2024-02-16 08:59:03.773 WARN [91:93] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:1664] [DcgmHostEngineHandler::LoadModule] 2024-02-16 08:59:03.773 ERROR [91:93] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3636] [DcgmHostEngineHandler::UnwatchFieldGroup] ```
Nvidia-smi output ```javascript [root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi Fri Feb 16 13:20:33 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 GRID A100D-7-80C On | 00000000:02:00.0 Off | On | | N/A N/A P0 N/A / N/A | 0MiB / 81920MiB | N/A Default | | | | Enabled | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 0 0 0 | 0MiB / 76011MiB | 98 0 | 7 0 5 1 1 | | | 0MiB / 4096MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ [root@nvidia-device-plugin-daemonset-w7q69 /]# nvidia-smi -q ==============NVSMI LOG============== Timestamp : Fri Feb 16 13:20:35 2024 Driver Version : 535.154.05 CUDA Version : 12.2 Attached GPUs : 1 GPU 00000000:02:00.0 Product Name : GRID A100D-7-80C Product Brand : NVIDIA Virtual Compute Server Product Architecture : Ampere Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled Addressing Mode : None MIG Mode Current : Enabled Pending : Enabled MIG Device Index : 0 GPU Instance ID : 0 Compute Instance ID : 0 Device Attributes Shared Multiprocessor count : 98 Copy Engine count : 7 Encoder count : 0 Decoder count : 5 OFA count : 1 JPG count : 1 ECC Errors Volatile SRAM Uncorrectable : 0 FB Memory Usage Total : 76011 MiB Reserved : 0 MiB Used : 0 MiB Free : 76011 MiB BAR1 Memory Total : 4096 MiB Used : 0 MiB Free : 4096 MiB Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A Minor Number : 0 MultiGPU Board : No FRU Part Number : N/A Module ID : N/A Inforom Version Image Version : N/A OEM Object : N/A ECC Object : N/A Power Management Object : N/A Inforom BBX Object Flush Latest Timestamp : N/A Latest Duration : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : VGPU Host VGPU Mode : N/A vGPU Software Licensed Product Product Name : NVIDIA Virtual Compute Server GPU Reset Status Reset Required : N/A Drain and Reset Recommended : N/A IBMNPU Relaxed Ordering Mode : N/A PCI GPU Link Info PCIe Generation Max : N/A Current : N/A Device Current : N/A Device Max : N/A Host Max : N/A Link Width Max : N/A Current : N/A Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : N/A Replay Number Rollovers : N/A Tx Throughput : N/A Rx Throughput : N/A Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : N/A Performance State : P0 Clocks Event Reasons : N/A FB Memory Usage Total : 81920 MiB Reserved : 5908 MiB Used : 0 MiB Free : 76011 MiB BAR1 Memory Usage Total : 4096 MiB Used : 0 MiB Free : 4096 MiB Conf Compute Protected Memory Usage Total : 0 MiB Used : 0 MiB Free : 0 MiB Compute Mode : Default Utilization Gpu : N/A Memory : N/A Encoder : N/A Decoder : N/A JPEG : N/A OFA : N/A Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 ECC Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : N/A GPU T.Limit Temp : N/A GPU Shutdown Temp : N/A GPU Slowdown Temp : N/A GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A GPU Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Module Power Readings Power Draw : N/A Current Power Limit : N/A Requested Power Limit : N/A Default Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 1410 MHz SM : 1410 MHz Memory : 1512 MHz Video : 1275 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : N/A SM : N/A Memory : N/A Video : N/A Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Fabric State : N/A Status : N/A Processes : None ```

We checked the health of the GPU using the script available at https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh. Successfully passed all the steps.

root@nvidia-dcgm-exporter-8kvn6:/# dcgmi diag -i 0 -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.1.8                                          |
| Driver Version Detected   | 535.154.05                                     |
| GPU Device IDs Detected   | 20b5                                           |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Skip                                           |
+-----  Integration  -------+------------------------------------------------+
| PCIe                      | Pass - All                                     |
+-----  Hardware  ----------+------------------------------------------------+
| GPU Memory                | Pass - All                                     |
+-----  Stress  ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+

cc: @Dentrax

nvvfedorov commented 6 months ago

It sounds like the issue was resolved.

jacksonyi0 commented 4 months ago

root@68e97f630ad1:/etc/dcgm-exporter# dcgm-exporter -f dcp-metrics-included.csv 2024/05/16 03:45:44 maxprocs: Leaving GOMAXPROCS=64: CPU quota undefined INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] Falling back to metric file 'dcp-metrics-included.csv' WARN[0000] Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled WARN[0000] Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled WARN[0000] Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled WARN[0000] Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled INFO[0000] Initializing system entities of type: GPU
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3 INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6 INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7 INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8 INFO[0000] Pipeline starting
INFO[0000] Starting webserver
FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use" what should i do? I started the dcgm-exporter in container mode and ran nv-hostengine -f host.log --log-level debug on the host. The error Err: Failed to start DCGM Server: -7 dcgmi modules -l display: +-----------+--------------------+--------------------------------------------------+ | List Modules | | Status: Success | +===========+====================+==================================================+ | Module ID | Name | State | +-----------+--------------------+--------------------------------------------------+ | 0 | Core | Loaded | | 1 | NvSwitch | Loaded | | 2 | VGPU | Not loaded | | 3 | Introspection | Not loaded | | 4 | Health | Not loaded | | 5 | Policy | Not loaded | | 6 | Config | Not loaded | | 7 | Diag | Not loaded | | 8 | Profiling | Not loaded | | 9 | SysMon | Not loaded | +-----------+--------------------+--------------------------------------------------+

nikkon-dev commented 4 months ago

@jack161641,

According to this error message:

FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"

You already have another dcgm-exporter instance working or another process occupying the 9400 port. There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).

jacksonyi0 commented 4 months ago

@jack161641,

According to this error message:

FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"

You already have another dcgm-exporter instance working or another process occupying the 9400 port. There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).

hi thank you for your reply But I started it through docker-compose and couldn't collect these metrics. Even if I configure DCGM_FI_PROF_DRAM_ACTIVE,DCGM_FI_PROF_PIPE_FP32_ACTIVE,DCGM_FI_PROF_PIPE_FP16_ACTIVE Still reporting an error “INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] Falling back to metric file 'default-counters.csv' WARN[0000] Skipping line 25 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled WARN[0000] Skipping line 26 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled WARN[0000] Skipping line 27 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled WARN[0000] Skipping line 28 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled ” This is the information displayed by nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:00:08.0 Off | N/A | | 0% 24C P8 15W / 350W | 22204MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:00:09.0 Off | N/A | | 0% 25C P8 17W / 350W | 19814MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:00:0A.0 Off | N/A | | 0% 24C P8 16W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:00:0B.0 Off | N/A | | 0% 25C P8 20W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+

nikkon-dev commented 4 months ago

@jacksonyi0, The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).

jacksonyi0 commented 4 months ago

@jacksonyi0, The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).

Thank you for your reply, then I will have to think of other solutions.

jacksonyi0 commented 4 months ago

hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand Try 'readlink --help' for more information. Enter the container through docker run -ti --entrypoint=/bin/sh --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04 bash /usr/local/dcgm/dcgm-exporter-entrypoint.sh still reports an error readlink: missing operand Try 'readlink --help' for more information The error runtime/cgo: pthread_create failed: Operation not permitted is reported through the /usr/bin/dcgm-exporter command. SIGABRT: abort PC=0x7f33397539fc m=0 sigcode=18446744073709551610

goroutine 0 [idle]: runtime: g 0: unknown pc 0x7f33397539fc stack: frame={sp:0x7ffdbe6fa820, fp:0x0} stack=[0x7ffdbdefbda0,0x7ffdbe6fadb0) 0x00007ffdbe6fa720: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa730: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa740: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa750: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa760: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa770: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa780: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa790: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7a0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7b0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7c0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7d0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7e0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7f0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa800: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa810: 0x0000000000000000 0x00007f33397539ee 0x00007ffdbe6fa820: <0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa830: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa840: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa850: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa860: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa870: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa880: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa890: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa8a0: 0x0000000000000000 0xa8d8867e7227a900 0x00007ffdbe6fa8b0: 0x00007f33396ba740 0x0000000000000006 0x00007ffdbe6fa8c0: 0x0000000001d0e4f7 0x00007ffdbe6fabf0 0x00007ffdbe6fa8d0: 0x0000000002992bc0 0x00007f33396ff476 0x00007ffdbe6fa8e0: 0x00007f33398d8e90 0x00007f33396e57f3 0x00007ffdbe6fa8f0: 0x0000000000000020 0x0000000000000000 0x00007ffdbe6fa900: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa910: 0x0000000000000000 0x0000000000000000 runtime: g 0: unknown pc 0x7f33397539fc What is causing this problem? Please help.