Open ligeweiwu opened 1 year ago
@ligeweiwu,
You need to start the nv-hostengine process. For your layout that would be sudo ./DCGM/_out/Linux-amd64-debug/bin/nv-hostengine -f host.log --log-level debug
.
The nv-hostengine should be run as root, otherwise, some functionality will not be available.
The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file.
If you want to debug the nv-hostengine itself, then it's better to add the -n
argument that would prevent the nv-hostengine from demonization.
Alternatively, you could generate .deb/.rpm packages and install them using the package manager, but that's a very inconvenient way for debugging.
@nikkon-dev It works. Thanks for your help!
@nikkon-dev Hi, I have another question. Now I have add the gpu to the Group (Group 2) successfully. And when I enter "./dcgmi diag -r 1 -g 2" It gives me an info "Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE= " How can I fix it ? Below is my working environment. Thanks.
+-------------------+----------------------------------------------------------+ | GROUPS | | 3 groups found. | +===================+==========================================================+ | Groups | | | -> 0 | | | -> Group ID | 0 | | -> Group Name | DCGM_ALL_SUPPORTED_GPUS | | -> Entities | GPU 0, GPU 1, GPU 2 | | -> 1 | | | -> Group ID | 1 | | -> Group Name | DCGM_ALL_SUPPORTED_NVSWITCHES | | -> Entities | None | | -> 2 | | | -> Group ID | 2 | | -> Group Name | GPU_GROUP | | -> Entities | GPU 2 | +-------------------+----------------------------------------------------------+
@nikkon-dev By the way, I build the debug version of DCGM with command "./build.sh -d -c" After that, it generate /_out/Linux-amd64-debug folder. Thanks.
@ligeweiwu,
Most likely, the nv-hostengine cannot find where the nvvs binary is located. It tries to find it in the default location where the package manager installs it.
To override that logic, you need to set an environment variable NVVS_BIN_PATH=full path to nvvs location in the _out/Linux-amd64-debug/share/nvidia-validation-suite
before running the nv-hostengine process.
@nikkon-dev Hi, thanks for your help. Now I have another question. Now, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug" again, it gives me an error: "Err: Failed to start DCGM Server: -7 User defined signal 1" How can I fix it? Thanks
@ligeweiwu,
-7 stands for DCGM_ST_INIT_ERROR = -7, //!< DCGM Init error
and it's hard to tell what is wrong without the debug logs (the host.log in your command).
@nikkon-dev Hi, Below is the snapshot of host.log. By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens. Thanks.
@nikkon-dev Hi, Below is the snapshot of host.log. By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens. Thanks.
Hi, this is the content of debug log. 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 1, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101f500000002 (eg 1, entityId 2, fieldId 501) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 501, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fd00000002 (eg 1, entityId 2, fieldId 509) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 509, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fe00000002 (eg 1, entityId 2, fieldId 510) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 510, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101ff00000002 (eg 1, entityId 2, fieldId 511) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 511, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020000000002 (eg 1, entityId 2, fieldId 512) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 512, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020100000002 (eg 1, entityId 2, fieldId 513) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Added field group id 3, name DCGM_INTERNAL_JOB, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmFieldGroup.cpp:172] [DcgmFieldGroupManager::AddFieldGroup] 2022-12-02 10:41:26.277 INFO [83727:83727] Created thread named "cache_mgr_main" ID 1854199552 DcgmThread ptr 0x0xe95a10 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields] 2022-12-02 10:41:26.278 DEBUG [83727:83727] dcgmStartEmbedded(): Embedded host engine started [/workspaces/HGDCGM/dcgmlib/src/DcgmApi.cpp:4826] [{anonymous}::StartEmbeddedV2] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Thread handle 1854199552 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83727] Entering dcgmEngineRun(unsigned short portNumber, char const *socketPath, unsigned int isConnectionTCP) (5555 127.0.0.1 1) [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun] 2022-12-02 10:41:26.278 INFO [83727:83732] Cache manager update thread starting [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6053] [DcgmCacheManager::run] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf500, eg 1, eid 1, fieldId 512 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields] 2022-12-02 10:41:26.278 INFO [83727:83727] Created thread named "dcgm_ipc" ID 1845806848 DcgmThread ptr 0x0xe94610 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus] 2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread handle 1845806848 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Appended entity blob eg 1, eid 1, fieldId 512, ts 1669948886278497, valueSize 2048, cached 1, buffered 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6452] [DcgmCacheManager::AppendEntityBlob] 2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:326] [DcgmIpc::InitTCPListenerSocket] 2022-12-02 10:41:26.278 ERROR [83727:83733] InitTCPListenerSocket() returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:195] [DcgmIpc::run] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf360, eg 1, eid 1, fieldId 510 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields] 2022-12-02 10:41:26.278 ERROR [83727:83727] initFuture returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:131] [DcgmIpc::Init] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus] 2022-12-02 10:41:26.278 ERROR [83727:83727] Got error Generic unspecified error from m_dcgmIpc.Init [/workspaces/HGDCGM/dcgmlib/src/DcgmHostEngineHandler.cpp:3767] [DcgmHostEngineHandler::RunServer] 2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread id 1845806848 stopped [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:308] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83727] Returning -7 [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun]
Thanks.
@ligeweiwu,
That's an easy one - you either already have nv-hostengine process running or something else is listening on the 5555 TCP port.
2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98
Normally, when nv-hostengine runs it creates a PID file that allows it to understand later if another instance is already running. That may be not true for debug builds.
@nikkon-dev Thanks for your help. This error has been solved. But when I set the environment variable "export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/" ,and then start the nv-hostengine service. After that, I run "./dcgmi diag -r 1 -g 2" (the gpu group has already been created). It still gives me an message: Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE=' By the way, when I install the deb package, this error would not happen. So is there other procedure that I missed? Thanks
@ligeweiwu,
In the debug nv-hostengine logs, look for the NVVS-related lines. Also, there may be nvvs.log file next to the nvvs binary.
@nikkon-dev Hi, this is the NVVS-related lines. 2022-12-01 22:01:31.184 DEBUG [1909:1911] [[Diag]] Unknown subcommand: 1 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:144] [DcgmModuleDiag::ProcessCoreMessage] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stdout: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:676] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stderr: { "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:677] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Failed to parse NVVS output: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:1145] [DcgmDiagManager::ValidateNvvsOutput] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Error happened during JSON parsing of NVVS output: The GPU Diagnostic returned Json that cannot be parsed. [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:885] [DcgmDiagManager::RunDiag] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] NVVS stderr: { "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:886] [DcgmDiagManager::RunDiag] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] RunDiagAndAction returned -40 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:120] [DcgmModuleDiag::ProcessRun_v6] 2022-12-01 22:01:31.195 DEBUG [1909:2000] Sending message to 17 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:1198] [DcgmIpc::SendMessageImpl]
I see there is a message about "The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand]" That means it use the location "/usr/share/nvidia-validation-suite/nvvs", But my workspace is not in /usr/share/nvidia-validation-suite/nvvs, and I also set the env variable NVVS_BIN_PATH to my workspace (export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/) Is there other env variable that I should be set ? (e.g. __NVVS_DBG_FILE) Thanks.
Do you set the NVVS_BIN_PATH for the root user or for your current user?
If you set export NVVS_BIN_PATH=
and then sudo ./nv-hostengine...
, the env variable is not set for the root user that runs the nv-hostengine.
Logfile should also mention the NVVS_BIN_PATH variable and its value.
@nikkon-dev Hi nikkon, (1): I do not find NVVS_BIN_PATH in the logfile. (2): I set export NVVS_BIN_PATH= and then sudo ./nv-hostengine... and when I set ./nv-hostengine... it works fine.
Thanks for you help.
@nikkon-dev Hi nikkon, Sorry to bother you again. But I still have some problem on debugging version of dcgm. Now I can run Diag test on corresponding gpu group. But when I use gdb to go into the source code, it can not find the source code correctly. For example:
@nikkon-dev Another Example, if I in GDB input: b CommandLineParser.cpp:99 //Entry method into this class for a given command line provided by main() The GDB will give a feedback: Breakpoint at 0x44b61e: file /opt/cross/x86_64-linux-gnu/include/tclap/CommandLineParser.cpp, line 165.
The gdb feedback breakpoint is not the point that I set. So i think the debug symbol may be crashed dcgmi? And is there anything else that I should amend to the command "./build.sh -d -c" ?
Thanks
@ligeweiwu,
There are several steps that you need to take to be able to debug:
If you were using the buildcontainer to build the dcgm, the sources location differs from your host. You will need to tell gdb the right place to find the sources and to convert buildcontainer path to your host path. Something like this should be added to a .gdbinit file:
directory /home/your_host_user/src/DCGM
set substitute-path /workspaces/dcgm /home/your_host_user/src/DCGM
The buildcontainer mounts sources to /workspaces/dcgm
inside the build container.
dcgmi diag
would not allow you to debug the diagnostic. In a nutshell, when you run dcgmi diag
it connects to nv-hostengine
to send the commands, and nv-hostengine
runs the nvvs
binary to perform actual diagnostic. So the path is dcgmi->nv-hostengine->nvvs. You will need to debug the nv-hostengine and tell gdb to follow children on forks.
@nikkon-dev Hi nikkon, i try it but it still have some problem. (1): my workspace is: /mnt/ssd/liuzhenli.lzl/DCGM (2): The execute path is: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin (3): I add .gdbinit file in /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin, and the content is:
directory /mnt/ssd/liuzhenli.lzl/DCGM set substitute-path /workspaces/DCGM /mnt/ssd/liuzhenli.lzl/DCGM
(4): Now I just want to test the commandLineParser function, e.g. in CommandLineParser.cpp: 1307: dcgmReturn_t CommandLineParser::ProcessDiagCommandLine(int argc, char const const argv) 1308:{ 1309: // Check for stop diag request 1310: const char *value = std::getenv(STOP_DIAG_ENV_VARIABLE_NAME); 1311: if (value != nullptr)
so I do: 4.1): gdb --args ./dcgmi diag -r 1 -g 2 (i think CommandLineParser is controlled by dcgmi, Is right ??) 4.2): b CommandLineParser.cpp:1310 And then run in gdb
(5): The gdb feedback is still: Breakpoint 1 at 0x408399: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419
(6): I also see build.sh in DCGM repo, and I see docker start command is: docker run --rm -u "$(id -u)":"$(id -g)" \ ${DOCKER_ARGS:-} \ -v "${DIR}":"${REMOTE_DIR}" \ And I print ${DIR}, it is actually:/mnt/ssd/liuzhenli.lzl/DCGM. And ${REMOTE_DIR} is: /workspaces/DCGM So i think I add set substitute-path correctly in .gdbinit file So is there anything that I do wrong?
By the way when I add other print info in CommandLineParser.cpp, and then run /dcgmi diag -r 1 -g 2, it can always print the new info.
Sorry to bother you. Thanks
It’s /workspaces/dcgm not capital letters. The path is case sensitive.
@nikkon-dev Hi nikkon, Now this is the content of my .gdbinit File in the path (/mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin)
directory /mnt/ssd/liuzhenli.lzl/DCGM set substitute-path /workspaces/dcgm /mnt/ssd/liuzhenli.lzl/DCGM (I have change DCGM to dcgm for /workspaces)
However, when I run gdb (gdb --args ./dcgmi diag -r 1 -g 2, b CommandLineParser.cpp:1310), It sill gives me the feed back: "Breakpoint 1 at 0x408389: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419."
May be there are some problem for gdb to go into source code of DCGM debug-version?
By the way, my gdb version is: GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2, and when I start gdb for ./dcgmi, it also gives warning: " BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001 BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002 "
Thanks for your help.
You need to update the gdb to a version that understands Dwarf5 debug info format. DCGM is built with GCC11 and uses Dwarf5 by default. You need GDB 10 or newer to be able to read that debug info.
@nikkon-dev Hi nikkon, thanks for your help. I have another question. As you said, "The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file." For instance, if I run ./dcgmi diag -r 3 -g 2 I see the content of host.log file, it contains the debug log of many functions (e.g. function defined in DcgmCacheManager.cpp). In other words, Is possible to say that host.log file contains almost all of function, which is utilized in "./dcgmi diag -r 3 -g 2" ? Thanks.
The host.log would contain only the nv-hostengine part. The real work is done by nvvs binary with its own log. For debugging needs, you could run nvvs directly.
@nikkon-dev Hi nikkon The path is dcgmi->nv-hostengine->nvvs. In other words, nvvs binary executes the diagnostic test on the corresponding GPU, and nv-hostengine can use NVML function to monitor the performance metrics of GPU (e.g. power, Temperature, ecc...). Is my understanding right? Thanks.
That's correct.
@nikkon-dev Hi nikkon I have another question. Does DCGM support cuda-12 now? When I update cuda driver and run the previous workflow again, it gives me an error:
_out/Linux-amd64-debug/share/nvidia-validation-suite/nvvs.log:78:2022-12-12 15:56:40.014 ERROR [29094:29094] Detected unsupported Cuda version: 12.0 [/workspaces/DCGM/nvvs/src/TestFramework.cpp:243] [TestFramework::GetPluginDirExtension]
Thanks
Cuda-12 will be supported in >=3.1.5 version. We are going to release it at the beginning of this week.
On Mon, Dec 12, 2022 at 12:26 AM ligeweiwu @.***> wrote:
@nikkon-dev https://github.com/nikkon-dev Hi nikkon I have another question. Does DCGM support cuda-12 now? When I update cuda driver and run the previous workflow again, it gives me an error:
_out/Linux-amd64-debug/share/nvidia-validation-suite/nvvs.log:78:2022-12-12 15:56:40.014 ERROR [29094:29094] Detected unsupported Cuda version: 12.0 [/workspaces/DCGM/nvvs/src/TestFramework.cpp:243] [TestFramework::GetPluginDirExtension]
Thanks
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/DCGM/issues/57#issuecomment-1346080376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHPPZKHVQRI6TDDOVVTCVTWM3ORXANCNFSM6AAAAAASQJ7CUM . You are receiving this because you were mentioned.Message ID: @.***>
@nikkon-dev Hello nikkon Now I am debugging "dcgmi diag -r 1". As you said before, I can run nvvs binary directly for debugging purpose. However, when I run ./nvvs -h. I could not find the same test info as "dcgmi diag -r 1". Could you tell me how can I run nvvs directly to get the same effect as I run "dcgmi diag -r 1"? Thank you very much.
That's --specifiedtest short
r 1 - short r 2 - long r 3 - xlong
@nikkon-dev Thanks for your help. It works.
@nikkon-dev Hello nikkon
I have another issue for debugging. Now I am running the command "./nvvs --specifiedtest pcie" I found that GDB can follow the source code, which is located in nvvs folder. But for the source code in pcie folder, (e.g. PcieWrapper.cpp), gdb can not go into it.
Let me give a more specific example. In PluginLib.cpp, it has a callback function "m_runTestCB(timeout, numParameters, parms, m_userData);" GDB can go into the source code of PluginLib.cpp. But when I enter "s" for "m_runTestCB", I can not follow the source code of RunTest, which is defined in PcieWrapper.cpp. I also try to add other breakpoints for source code of pcie, but they do not work as well.
Could you tell me how to fix it?
Thanks
In gdb, you need to set solib-search-path
to plugins/cudaXX directory, where XX is your driver's Cuda version: 10, 11 or 12
@nikkon-dev Hi nikkon
I am building the DCGM source code and using nvvs/dcgmi to perform the diagnostic test. I see all plugintest and they are all in the format of .so. .But when I want to perform the "memory bandwidth" diagnostic, they give me an error:
./dcgmi diag -r "memory bandwidth" -g 2 Error: requested test "memory bandwidth" was not found among possible test choices.
In my case, all plugin.so are in the location of /username/DCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/plugins/cuda11, and there is no name of "memory bandwidth". And I also see the source code, actually i think it doesn't have the option name "memory bandwidth". It only has "memtest".
So please tell me how can I run "memory bandwith" using DCGM source code?
By the way, the memtest is OK ("./dcgmi diag -r memtest -g 2" works fine, and I also see the corresponding libMemtest.so in plugins/cuda11, and the source code has the option "memtest").
Thanks
Hi, I am building the DCGM debug version from source code and it has already been built successfully on a server. For instance, after building, it will generate a executable file ./DCGM/_out/Linux-amd64-debug/bin/dcgmi. The executable file dcgmi can be executed for some simple application, such as ./dcgmi --help, ./dcgmi -v. But for some more complex application, such as ./dcgmi group --list, it will generated "Error: unable to establish a connection to the specified host: localhost" "Error: Unable to connect to host engine. Host engine connection invalid/disconnected."
Previously, I also install the prebuilt version of DCGM on another server, this problem also happened. But when I enter "sudo systemctl --now enable nvidia-dcgm", then the DCGM service started and this problem can be solved.
But now, when I compile DCGM from source code and run "sudo systemctl --now enable nvidia-dcgm", it will generate: "Failed to enable unit: Unit file nvidia-dcgm.service does not exist." OR "Failed to execute operation: No such file or directory".
So please help me, how can I start the DCGM service when I build DCGM debug version from source code?
The GPU environment on server is shown in below.