Open jxh314 opened 11 months ago
@jxh314,
There are a few limitations with the profiling module for GPUs. Firstly, it is not open-sourced. If you want to use it, you must obtain the module from the official DCGM packages, as it cannot be built from sources. Secondly, the profiling module only supports Tesla/Quadro grade GPUs and not RTX/GTX GPUs.
If you are using the module from the official datacenter-gpu-manager packages and still facing issues with loading it, please provide the debug logs from the nv-hostengine by running the command nv-hostengine -f host.debug.log --log-level debug
. This is particularly relevant if you have V100/A100 GPUs.
@nikkon-dev Thanks a lot!
I'm having this issue on A100 with MIG. Profiler will load on an identical system but without MIG. I'm on dcgm 3.1.7, saw nothing in the release notes to suggest this was fixed since. If there's a fix, I will seek approval for an out-of-downtime upgrade. But since you asked for a debug log, here's a debug log!
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] DoOneUpdateAllFields returned 1708571043208144 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmCacheManager.cpp:2409] [DcgmCacheManager::UpdateAllFields]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] Entering dcgmModuleIdToName(dcgmModuleId_t id, char const **name) (8, 0x7fc255921070) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/entry_point.h:912] [dcgmModuleIdToName]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] Returning 0 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/entry_point.h:912] [dcgmModuleIdToName]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] [[Profiling]] Initialized logging for module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/modules/DcgmModule.h:90] [DcgmModuleWithCoreProxy<moduleId>::DcgmModuleWithCoreProxy]
2024-02-22 13:03:43.501 DEBUG [4042437:4042440] [[Profiling]] __DCGM_PROF_NO_SKU_CHECK was NOT set. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:582] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::ReadEnvironmentalVariables]
2024-02-22 13:03:43.514 DEBUG [4042437:4042440] [[Profiling]] NVPW_InitializeTarget() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1365] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_DCGM_LoadDriver() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1366] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_InitializeHost() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1367] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_GetDeviceCount() was successful [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1380] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetPciBusIds() was successful. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1397] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 33, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 129, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 7 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 7 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 8 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 9 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 9 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 10 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 10 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 11 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 11 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 12 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 12 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] NVPW_Device_GetMigAttributes was successful (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1419] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping MIG CI id: 13 (device: 0, bus: 226, domain: 0) [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1425] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Reason: gpuInstanceId = 13 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1426] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] [[Profiling]] Skipping a GPU unknown to LOP [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1457] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] No GPU with LOP support were found. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:1551] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::InitLop]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgm_private/modules/profiling/DcgmModuleProfiling.cpp:502] [DcgmNs::Modules::Profiling::DcgmModuleProfiling::DcgmModuleProfiling]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] [[Profiling]] A runtime exception occured when creating module. Ex: DcgmModuleProfiling failed to initialize. See the logs. [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/modules/DcgmModule.h:146] [{anonymous}::SafeWrapper]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] Failed to load module 8 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:1734] [DcgmHostEngineHandler::LoadModule]
2024-02-22 13:03:43.997 ERROR [4042437:4042440] DCGM_PROFILING_SR_WATCH_FIELDS failed with -33 [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3536] [DcgmHostEngineHandler::WatchFieldGroup]
2024-02-22 13:03:43.997 DEBUG [4042437:4042440] Got 3 entities and 1 fields [/workspaces/dcgm-rel_dcgm_3_1-postmerge@2/dcgmlib/src/DcgmHostEngineHandler.cpp:3578] [DcgmHostEngineHandler::UnwatchFieldGroup]
@Xaraxia,
Could you provide the nvidia-smi
and nvidia-smi -q
output?
Please kindly provide more information about your setup. Specifically, let us know if you are running nv-hostengine on a bare-metal host or inside a container. From the logs you provided, it appears that dcgm can detect some MIG instances but not all devices. Please note that for MIG configurations to work properly, dcgm needs to run on the host and access the entire device rather than just individual MIG instances. Also, it's important to ensure that CUDA_VISIBLE_DEVICES is not set for dcgm.
Hi nikkon-dev,
This is running on a bare-metal host as root using the nvidia systemd service. HPC jobs can see and use the MIG instances.
root 5380 1 13 Feb06 ? 2-14:13:45 /usr/local/sbin/dcgm-exporter
root 4059539 1 0 Feb22 ? 00:00:37 /usr/bin/nv-hostengine -n --service-account nvidia-dcgm
Hello, here is output of NVIDIA GeForce RTX 4090. RTX 4090 can't load Profiling
module, right ?
# dcgmi modules -l
+-----------+--------------------+--------------------------------------------------+
| List Modules |
| Status: Success |
+===========+====================+==================================================+
| Module ID | Name | State |
+-----------+--------------------+--------------------------------------------------+
| 0 | Core | Loaded |
| 1 | NvSwitch | Loaded |
| 2 | VGPU | Not loaded |
| 3 | Introspection | Not loaded |
| 4 | Health | Not loaded |
| 5 | Policy | Not loaded |
| 6 | Config | Not loaded |
| 7 | Diag | Not loaded |
| 8 | Profiling | Failed to load |
| 9 | SysMon | Failed to load |
+-----------+--------------------+--------------------------------------------------+
It appears to be the case. I get the same result as you when running the dcgmi modules -l
command.
@Saigut
Hello! When I running the command below
dcgmi dmon -e 1001
the result isAnd when I check the module list with
dcgmi modules -l
,it shows that the profiling module failed to load. May I ask why and how can get the profile module loaded? Thanks a lot.