NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

How to get the module profile loaded? #132

Open jxh314 opened 11 months ago

jxh314 commented 11 months ago

Hello! When I running the command below dcgmi dmon -e 1001 the result is

Error setting watches. Result: -33: This request is serviced by a module of DCGM that is not currently loaded\

And when I check the module list with dcgmi modules -l,it shows that the profiling module failed to load. May I ask why and how can get the profile module loaded? Thanks a lot.

+-----------+--------------------+--------------------------------------------------+ | List Modules | | Status: Success | +===========+====================+==================================================+ | Module ID | Name | State | +-----------+--------------------+--------------------------------------------------+ | 0 | Core | Loaded | | 1 | NvSwitch | Loaded | | 2 | VGPU | Not loaded | | 3 | Introspection | Not loaded | | 4 | Health | Not loaded | | 5 | Policy | Not loaded | | 6 | Config | Not loaded | | 7 | Diag | Not loaded | | 8 | Profiling | Failed to load | | 9 | SysMon | Not loaded | +-----------+--------------------+--------------------------------------------------+

nikkon-dev commented 11 months ago

@jxh314,

There are a few limitations with the profiling module for GPUs. Firstly, it is not open-sourced. If you want to use it, you must obtain the module from the official DCGM packages, as it cannot be built from sources. Secondly, the profiling module only supports Tesla/Quadro grade GPUs and not RTX/GTX GPUs.

If you are using the module from the official datacenter-gpu-manager packages and still facing issues with loading it, please provide the debug logs from the nv-hostengine by running the command nv-hostengine -f host.debug.log --log-level debug. This is particularly relevant if you have V100/A100 GPUs.

jxh314 commented 11 months ago

@nikkon-dev Thanks a lot!

Xaraxia commented 7 months ago

I'm having this issue on A100 with MIG. Profiler will load on an identical system but without MIG. I'm on dcgm 3.1.7, saw nothing in the release notes to suggest this was fixed since. If there's a fix, I will seek approval for an out-of-downtime upgrade. But since you asked for a debug log, here's a debug log!

2024-02-22 13:03:43.501 DEBUG [4042437:4042440] DoOneUpdateAllFields returned 1708571043208144 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:2409] [DcgmCacheManager::UpdateAllFields]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7fc255921070) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/entry_point.h:912] [dcgmModuleIdToName]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] Returning 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/entry_point.h:912] [dcgmModuleIdToName]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] [[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/modules/DcgmModule.h:90] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] [[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:582] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]
2024-02-22 13:03:43.514 DEBUG [4042437:4042440] [[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1365] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_DCGM_LoadDriver() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1366] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_InitializeHost() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1367] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_GetDeviceCount() was successful [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1380] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetPciBusIds() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1397] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] No GPU with LOP support were found. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1551] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:1734] [DcgmHostEngineHandler::LoadModule]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3536] [DcgmHostEngineHandler::WatchFieldGroup]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] Got 3 entities and 1 fields [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3578] [DcgmHostEngineHandler::UnwatchFieldGroup]
nikkon-dev commented 7 months ago

@Xaraxia,

Could you provide the nvidia-smi and nvidia-smi -q output?

Please kindly provide more information about your setup. Specifically, let us know if you are running nv-hostengine on a bare-metal host or inside a container. From the logs you provided, it appears that dcgm can detect some MIG instances but not all devices. Please note that for MIG configurations to work properly, dcgm needs to run on the host and access the entire device rather than just individual MIG instances. Also, it's important to ensure that CUDA_VISIBLE_DEVICES is not set for dcgm.

Xaraxia commented 7 months ago

Hi nikkon-dev,

This is running on a bare-metal host as root using the nvidia systemd service. HPC jobs can see and use the MIG instances.

root        5380       1 13 Feb06 ?        2-14:13:45 /usr/local/sbin/dcgm-exporter
root     4059539       1  0 Feb22 ?        00:00:37 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm

bun005-smi.txt

Saigut commented 6 months ago

Hello, here is output of NVIDIA GeForce RTX 4090. RTX 4090 can't load Profiling module, right ?

# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules                                                                      |
| Status: Success                                                                   |
+===========+====================+==================================================+
| Module ID | Name               | State                                            |
+-----------+--------------------+--------------------------------------------------+
| 0         | Core               | Loaded                                           |
| 1         | NvSwitch           | Loaded                                           |
| 2         | VGPU               | Not loaded                                       |
| 3         | Introspection      | Not loaded                                       |
| 4         | Health             | Not loaded                                       |
| 5         | Policy             | Not loaded                                       |
| 6         | Config             | Not loaded                                       |
| 7         | Diag               | Not loaded                                       |
| 8         | Profiling          | Failed to load                                   |
| 9         | SysMon             | Failed to load                                   |
+-----------+--------------------+--------------------------------------------------+
jxh314 commented 6 months ago

It appears to be the case. I get the same result as you when running the dcgmi modules -l command. @Saigut