Closed ppreet closed 6 months ago
Hello,
The DCP metrics (fieldid 1001-1012) are supported on Volta and newer architectures only. Kepler is not supported.
WBR, Nik
Hello,
I have Ampere A40 GPU, but I also have the same error:
dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Failed to load |
+-----------+--------------------+--------------------------------------------------+
What could be reason of this?
@babinskiy,
There may be several reasons. Could you provide us the debug logs from the nv-hostengine?
nv-hostengine -f host.log --log-level debug
WBR, Nik
HI @nikkon-dev, Thanks for your response
The only related I found in log
2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 10 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Added GroupId 2 name dcgmi_22409_1 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:273] [DcgmGroupManager::AddNewGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Processing request of type 47 for connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5763] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.707 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000000 (eg 1, entityId 0, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 0, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding WatchInfo on entityKey 0x103e900000001 (eg 1, entityId 1, fieldId 1001) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2054] [DcgmCacheManager::GetEntityWatchInfo]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Adding new watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3021] [DcgmCacheManager::AddOrUpdateWatcher]
2022-04-26 06:20:26.708 DEBUG [22375:22377] UpdateWatchFromWatchers minMonitorFreqUsec 5000, minMaxAgeUsec 1000000, hsw 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3063] [DcgmCacheManager::UpdateWatchFromWatchers]
2022-04-26 06:20:26.708 DEBUG [22375:22377] AddFieldWatch eg 1, eid 1, fieldId 1001, mfu 5000, msa 0.000000, mka 2, sfu 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3156] [DcgmCacheManager::AddEntityFieldWatch]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7f70fb244028) [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] Returning 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/entry_point.h:908] [dcgmModuleIdToName]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:91] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] Logger address 0x7f70f8294740 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:92] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2022-04-26 06:20:26.708 DEBUG [22375:22377] [[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:450] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]
2022-04-26 06:20:26.722 DEBUG [22375:22377] [[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1215] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1216] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2022-04-26 06:20:26.722 ERROR [22375:22377] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:385] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2022-04-26 06:20:26.723 ERROR [22375:22377] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_3-postmerge/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper]
2022-04-26 06:20:26.723 ERROR [22375:22377] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3617] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5828] [DcgmHostEngineHandler::WatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] Got 2 entities and 1 fields [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5870] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 0, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveWatcher removing existing watcher type 0, connectionId 1 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2966] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:26.723 DEBUG [22375:22377] RemoveEntityFieldWatch eg 1, eid 1, nvmlFieldId 1001, clearCache 0 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:3212] [DcgmCacheManager::RemoveEntityFieldWatch]
2022-04-26 06:20:26.723 WARN [22375:22377] Skipping loading of module 8 in status 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:3534] [DcgmHostEngineHandler::LoadModule]
2022-04-26 06:20:26.723 ERROR [22375:22377] DCGM_PROFILING_SR_UNWATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:5914] [DcgmHostEngineHandler::UnwatchFieldGroup]
2022-04-26 06:20:44.586 DEBUG [22375:22377] Processing request of type 3 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:2436] [DcgmHostEngineHandler::ProcessRequest]
2022-04-26 06:20:44.586 DEBUG [22375:22377] persistAfterDisconnect 0 for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmHostEngineHandler.cpp:231] [DcgmHostEngineHandler::ProcessClientLogin]
2022-04-26 06:20:44.587 DEBUG [22375:22377] Removed 0 groups for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmGroupManager.cpp:364] [DcgmGroupManager::RemoveAllGroupsForConnection]
2022-04-26 06:20:44.587 DEBUG [22375:22377] No field groups found for connectionId 2 [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmFieldGroup.cpp:392] [DcgmFieldGroupManager::OnConnectionRemove]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
2022-04-26 06:20:44.587 DEBUG [22375:22377] RemoveWatcher() type 0, connectionId 2 was not a watcher [/workspaces/dcgm-rel_dcgm_2_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:2992] [DcgmCacheManager::RemoveWatcher]
...
Full version of log I uploaded here: https://fex.net/s/2p0p1bm
Will be grateful for any help!
@babinskiy,
Could you confirm that the persistence mode is enabled on the GPU?
nvidia-smi
output would tell you.
nvidia-smi -pm 1
to enable it.
Hi @nikkon-dev,
I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 , but same problem occurs like above.
Any help would be appreciated.
time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter"
time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!"
time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-11-21T05:43:49Z" level=info msg="Starting webserver"
time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"
@yh0413,
Running the nv-hostengine inside a docker container if MIG is enabled may be tricky. The nv-hostengine uses MIG management API to get MIG profiles information (this is privileged functionality). By default a container would not have the proper capability to access MIG profiles information. For example, this is how you could run a docker container to allow it to access the MIG API:
$ docker run --cap-add SYS_ADMIN --runtime=nvidia \
--gpus all \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_MIG_CONFIG_DEVICES=all \
-e NVIDIA_MIG_MONITOR_DEVICES=all \
...
Usually, when MIG is enabled, we recommend running nv-hostengine on bare metal and letting dcgm-exporter connect to it instead of running an embedded hostengine.
I hope that would help.
WBR, Nik
It works well. I solved the issue by connecting dcgm-exporter with nv-hostengine which is running on host.
Thank you!
Hi @yh0413, My vm with MIG meet the same problem, dcgmi profiling failed to load. the cuda version is 11.4 and nvidia driver is 470.141.03, do you have any suggestions?
@wpso,
Could you provide more information about your setup? Do you use passthrough or vgpu?
@nikkon-dev we use MIG vgpu for vm. we tried three dcgm version(2.0.13,2.0.15 and 2.1.5), both the host and guest all have the problem, and the card is A100 80G(20b5).
@wpso,
I'm a bit confused. vGPUs do not allow MIG configurations until you are using the passthrough approach (aka grant exclusive access to the whole GPU to the VM). What hypervisor are you using? In general, DCGM needs full access to the hardware, and the driver to be able to reach the MIG management API which is usually not virtualized.
@nikkon-dev
hi,I get the same error, even if I start dcgm-exporter with nv-hostengine
_root@release-name-dcgm-exporter-b2xrs:/# dcgm-exporter -r localhost:5555 -f /etc/dcgm-exporter/custom-collectors.csv -a :9401 INFO[0000] Starting dcgm-exporter INFO[0000] Attemping to connect to remote hostengine at localhost:5555 INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/custom-collectors.csv WARN[0000] Skipping line 6 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled WARN[0000] Skipping line 7 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled WARN[0000] Skipping line 21 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled WARN[0000] Skipping line 22 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 23 ('DCGM_FI_PROF_DRAMACTIVE'): metric not enabled
ENV DCGM Exporter version 3.1.6-3.1.3 Driver Version : 460.91.03 CUDA Version : 11.2 Persistence-M ON Tesla V100-SXM2-32GB
@jack161641,
To determine the cause of the profiling module load failure, we must analyze the nv-hostengine debug logs. The reasons could be varied, ranging from unsupported GPUs to insufficient privileges.
To obtain the debug logs, you can restart the nv-hostengine with the following arguments: nv-hostengine -f /tmp/host.debug.log --log-level-debug
I have the same problem.
Environment
# dcgmi -v Version : 2.4.6 Build ID : 11 Build Date : 2022-07-06 Build Type : Release Commit ID : b21fb88d38b2d70a5b3330e5806962ad6f207e69 Branch Name : rel_dcgm_2_4 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64
nvidia-smi cmd result: ` Wed Nov 15 07:27:20 2023 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA L40S On | 00000000:27:00.0 Off | 0 | | N/A 39C P8 34W / 350W | 3MiB / 46068MiB | 0% Default | | | | N/A |
+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=====================================================================================`
I can't seem to get profiling metrics to show up, though other metrics show up fine.
INFO[0000] Starting dcgm-exporter INFO[0000] DCGM successfully initialized! INFO[0000] Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] No configmap data specified, falling back to metric file /etc/dcgm-exporter/default-counters.csv WARN[0000] Skipping line 13 ('DCGM_FI_PROF_SM_ACTIVE'): metric not enabled WARN[0000] Skipping line 14 ('DCGM_FI_PROF_SM_OCCUPANCY'): metric not enabled WARN[0000] Skipping line 15 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled WARN[0000] Skipping line 16 ('DCGM_FI_PROF_PIPE_FP64_ACTIVE'): metric not enabled WARN[0000] Skipping line 17 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled WARN[0000] Skipping line 18 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled INFO[0000] Pipeline starting INFO[0000] Starting webserver
nv-hostengine.log
/var/log/nv-hostengine.log 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] AttachToNscq() returned -25 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmNvSwitchManager.cpp:317] [DcgmNs::DcgmNvSwitchMan ager::Init] 2023-11-15 07:18:06.426 ERROR [104:104] [[NvSwitch]] Could not initialize switch manager. Ret: DCGM library could not be found [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/modules/nvswitch/DcgmMod uleNvSwitch.cpp:34] [DcgmNs::DcgmModuleNvSwitch::DcgmModuleNvSwitch] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] NVPW_DCGM_LoadDriver returned1 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1353] [Dcgm Ns::Modules::Profiling::DcgmModuleProfiling::InitLop] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgm_private/modules/profiling/DcgmModule Profiling.cpp:481] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling] 2023-11-15 07:18:06.453 ERROR [104:104] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_2_4 -postmerge@2/modules/DcgmModule.h:148] [{anonymous}::SafeWrapper] 2023-11-15 07:18:06.453 ERROR [104:104] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3671] [DcgmHostEngineHandler::LoadModule] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.542 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal] 2023-11-15 07:18:06.544 ERROR [104:118] Got nvmlSt 3,3 from nvmlDeviceGetFieldValues fieldValues [/workspaces/dcgm-rel_dcgm_2_4-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:10670] [DcgmCacheMana ger::ReadAndCacheNvLinkBandwidthTotal]
Any help would be appreciated.
@chenaidong1,
In your case, you need to update the dcgm-exporter to a newer version.
You are using DCGM 2.4.6, which is quite outdated and does not support L40S GPUs. Try using dcgm-exporter based on the 3.2.x or 3.3.x releases.
@nikkon-dev
Thank for your reply.
I try to use dcgm-exporter based on the 3.2.x or 3.3.x releases to run on the host. Profiling metrics are collected now.
In another environment, I use dcgm-exporter based on the 3.2.5 version, fail to collect profiling metrics
The environment information is as follows: `nvidia-smi Tue Nov 14 03:38:24 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: N/A | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 On | 00000000:00:09.0 Off | Off | | N/A 53C P8 18W / 70W | 86MiB / 16384MiB | 0% Default |
dcgmi profile -l Error: Unable to Get supported metric groups: This request is serviced by a module of DCGM that is not currently loaded.`
What is the reason for the failure? Cuda is not installed. Does profile metrics depend on cuda?
Encountered this problem on GKE's Nvidia L4 machine, fixed by upgrade the docker image of dcgm-exporter & dcgm to 3.3.0
Hi @nikkon-dev,
I'm currently running dcgm-exporter 2.3.5-2.6.5 without any problem except DCP metrics for MIG. To solve some DCGM issues about DCP metrics for MIG, I tried to update dcgm-exporter to 3.0.4-3.0.0 , but same problem occurs like above.
Any help would be appreciated.
Env
- Kubernetes v1.19.9
- A30
- NVIDIA Driver 460.73.01 (persistence mode is enabled)
Apps related NVIDIA
- nvidia-device-plugin v0.11.0
- nvidia-dcgm-exporter 3.0.4-3.0.0 (starts nv-hostengine as an embedded process)
dcgm-exporter log
time="2022-11-21T05:43:47Z" level=info msg="Starting dcgm-exporter" time="2022-11-21T05:43:47Z" level=info msg="DCGM successfully initialized!" time="2022-11-21T05:43:47Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded" time="2022-11-21T05:43:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv" time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 19 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled" time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled" time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled" time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled" time="2022-11-21T05:43:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled" time="2022-11-21T05:43:49Z" level=info msg="Kubernetes metrics collection enabled!" time="2022-11-21T05:43:49Z" level=info msg="Starting webserver" time="2022-11-21T05:43:49Z" level=info msg="Pipeline starting"
I've discovered some driver/CUDA compatibility issues when collecting DCP metrics. NVIDIA Driver 460.73.01, which was shipped with CUDA 11.2, is not compatible with nvidia-dcgm-exporter 3.0.4-3.0.0 as it was built on CUDA 11.7. In my case, I resolved this issue by using an older image that was built on CUDA 11.2.
Hi, @nikkon-dev
We're using dcgm-exporter:3.1.8-3.1.5-ubuntu20.04 on Kubernetes(v1.26.6). Additionally, we are utilizing GRID-A100D-7-80C-MIG-7g.80gb. We have observed some errors in the dcgm-exporter pod logs.
time="2024-02-16T11:25:47Z" level=info msg="Starting dcgm-exporter"
time="2024-02-16T11:25:47Z" level=info msg="DCGM successfully initialized!"
time="2024-02-16T11:25:47Z" level=info msg="Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2024-02-16T11:25:47Z" level=info msg="No configmap data specified, falling back to metric file /etc/dcgm-exporter/dcp-metrics-included.csv"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled"
time="2024-02-16T11:25:47Z" level=warning msg="Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled"
It looks like the profiling module fails to load:
root@nvidia-dcgm-exporter-8kvn6:/# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Loaded |
| 8 | Profiling | Failed to load |
+-----------+--------------------+--------------------------------------------------+
We encountered an error with code -33 while running the `dcgmi dmon -e 1010`` command.
root@nvidia-dcgm-exporter-8kvn6:/# dcgmi dmon -e 1010
#Entity PCIRX
ID
Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded
$ docker run --cap-add SYS_ADMIN --runtime=nvidia \ --gpus all \ -e NVIDIA_VISIBLE_DEVICES=all \ -e NVIDIA_MIG_CONFIG_DEVICES=all \ -e NVIDIA_MIG_MONITOR_DEVICES=all \ ...
We checked the health of the GPU using the script available at https://github.com/aws/aws-parallelcluster-cookbook/blob/v3.8.0/cookbooks/aws-parallelcluster-slurm/files/default/config_slurm/scripts/health_checks/gpu_health_check.sh. Successfully passed all the steps.
root@nvidia-dcgm-exporter-8kvn6:/# dcgmi diag -i 0 -r 2
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.1.8 |
| Driver Version Detected | 535.154.05 |
| GPU Device IDs Detected | 20b5 |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Skip |
+----- Integration -------+------------------------------------------------+
| PCIe | Pass - All |
+----- Hardware ----------+------------------------------------------------+
| GPU Memory | Pass - All |
+----- Stress ------------+------------------------------------------------+
+---------------------------+------------------------------------------------+
cc: @Dentrax
It sounds like the issue was resolved.
root@68e97f630ad1:/etc/dcgm-exporter# dcgm-exporter -f dcp-metrics-included.csv
2024/05/16 03:45:44 maxprocs: Leaving GOMAXPROCS=64: CPU quota undefined
INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded
INFO[0000] Falling back to metric file 'dcp-metrics-included.csv'
WARN[0000] Skipping line 20 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): metric not enabled
WARN[0000] Skipping line 21 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled
WARN[0000] Skipping line 22 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled
WARN[0000] Skipping line 23 ('DCGM_FI_PROF_PCIE_TX_BYTES'): metric not enabled
WARN[0000] Skipping line 24 ('DCGM_FI_PROF_PCIE_RX_BYTES'): metric not enabled
INFO[0000] Initializing system entities of type: GPU
INFO[0000] Not collecting NvSwitch metrics; no fields to watch for device type: 3
INFO[0000] Not collecting NvLink metrics; no fields to watch for device type: 6
INFO[0000] Not collecting CPU metrics; no fields to watch for device type: 7
INFO[0000] Not collecting CPU Core metrics; no fields to watch for device type: 8
INFO[0000] Pipeline starting
INFO[0000] Starting webserver
FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"
what should i do?
I started the dcgm-exporter in container mode and ran nv-hostengine -f host.log --log-level debug on the host. The error Err: Failed to start DCGM Server: -7
dcgmi modules -l display:
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Not loaded |
| 9 | SysMon | Not loaded |
+-----------+--------------------+--------------------------------------------------+
@jack161641,
According to this error message:
FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"
You already have another dcgm-exporter instance working or another process occupying the 9400 port. There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).
@jack161641,
According to this error message:
FATA[0000] Failed to Listen and Server HTTP server. error="listen tcp :9400: bind: address already in use"
You already have another dcgm-exporter instance working or another process occupying the 9400 port. There may be only one nv-hostengine instance per GPU (it does not matter if it's standalone, embedded, bare-metal, or containerized).
hi thank you for your reply But I started it through docker-compose and couldn't collect these metrics. Even if I configure DCGM_FI_PROF_DRAM_ACTIVE,DCGM_FI_PROF_PIPE_FP32_ACTIVE,DCGM_FI_PROF_PIPE_FP16_ACTIVE Still reporting an error “INFO[0000] Not collecting DCP metrics: This request is serviced by a module of DCGM that is not currently loaded INFO[0000] Falling back to metric file 'default-counters.csv' WARN[0000] Skipping line 25 ('DCGM_FI_PROF_DRAM_ACTIVE'): metric not enabled WARN[0000] Skipping line 26 ('DCGM_FI_PROF_PIPE_FP32_ACTIVE'): metric not enabled WARN[0000] Skipping line 27 ('DCGM_FI_PROF_PIPE_FP16_ACTIVE'): metric not enabled WARN[0000] Skipping line 28 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): metric not enabled ” This is the information displayed by nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.147.05 Driver Version: 525.147.05 CUDA Version: 12.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:00:08.0 Off | N/A | | 0% 24C P8 15W / 350W | 22204MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... Off | 00000000:00:09.0 Off | N/A | | 0% 25C P8 17W / 350W | 19814MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA GeForce ... Off | 00000000:00:0A.0 Off | N/A | | 0% 24C P8 16W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA GeForce ... Off | 00000000:00:0B.0 Off | N/A | | 0% 25C P8 20W / 350W | 0MiB / 24576MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+
@jacksonyi0, The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).
@jacksonyi0, The profiling module that handles the DCGM_FI_PROF* metrics does not support consumer-grade GPUs (GeForce GTX/RTX). These metrics, named DCP (Data Center Profing), require datacenter-grade GPUs (V10x/A10x/H10x) or workstation-grade GPUs (previously known as Quadro).
Thank you for your reply, then I will have to think of other solutions.
hello,I use docker run -d --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.6-3.4.2-ubuntu22.04 to start the container and an error message readlink: missing operand Try 'readlink --help' for more information. Enter the container through docker run -ti --entrypoint=/bin/sh --gpus all --rm -p 9400:9400 nvcr.io/nvidia/k8s/dcgm-exporter:3.3.5-3.4.1-ubuntu22.04 bash /usr/local/dcgm/dcgm-exporter-entrypoint.sh still reports an error readlink: missing operand Try 'readlink --help' for more information The error runtime/cgo: pthread_create failed: Operation not permitted is reported through the /usr/bin/dcgm-exporter command. SIGABRT: abort PC=0x7f33397539fc m=0 sigcode=18446744073709551610
goroutine 0 [idle]: runtime: g 0: unknown pc 0x7f33397539fc stack: frame={sp:0x7ffdbe6fa820, fp:0x0} stack=[0x7ffdbdefbda0,0x7ffdbe6fadb0) 0x00007ffdbe6fa720: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa730: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa740: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa750: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa760: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa770: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa780: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa790: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7a0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7b0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7c0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7d0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7e0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa7f0: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa800: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa810: 0x0000000000000000 0x00007f33397539ee 0x00007ffdbe6fa820: <0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa830: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa840: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa850: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa860: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa870: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa880: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa890: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa8a0: 0x0000000000000000 0xa8d8867e7227a900 0x00007ffdbe6fa8b0: 0x00007f33396ba740 0x0000000000000006 0x00007ffdbe6fa8c0: 0x0000000001d0e4f7 0x00007ffdbe6fabf0 0x00007ffdbe6fa8d0: 0x0000000002992bc0 0x00007f33396ff476 0x00007ffdbe6fa8e0: 0x00007f33398d8e90 0x00007f33396e57f3 0x00007ffdbe6fa8f0: 0x0000000000000020 0x0000000000000000 0x00007ffdbe6fa900: 0x0000000000000000 0x0000000000000000 0x00007ffdbe6fa910: 0x0000000000000000 0x0000000000000000 runtime: g 0: unknown pc 0x7f33397539fc What is causing this problem? Please help.
Hello,
dcgmi version: 2.2.9
I built dcgm-exporter from source and am running it on a single GPU (Tesla K80). I can't seem to get profiling metrics to show up, though other metrics show up fine.
It looks like the profiling module fails to load:
Though I'm not sure whether this is attributable to dcgm-exporter or dcgm, because when I can't get the metrics to load even when using dcgmi directly:
I've directly followed the instruction to build dcgm-exporter from source and the service runs inside a sidecar container that is responsible for collecting metrics.
How can I enable the collection of profiling metrics?