NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
404 stars 52 forks source link

How to enable DCGM service when I build DCGM debug version from the source code #57

Open ligeweiwu opened 1 year ago

ligeweiwu commented 1 year ago

Hi, I am building the DCGM debug version from source code and it has already been built successfully on a server. For instance, after building, it will generate a executable file ./DCGM/_out/Linux-amd64-debug/bin/dcgmi. The executable file dcgmi can be executed for some simple application, such as ./dcgmi --help, ./dcgmi -v. But for some more complex application, such as ./dcgmi group --list, it will generated "Error: unable to establish a connection to the specified host: localhost" "Error: Unable to connect to host engine. Host engine connection invalid/disconnected."

Previously, I also install the prebuilt version of DCGM on another server, this problem also happened. But when I enter "sudo systemctl --now enable nvidia-dcgm", then the DCGM service started and this problem can be solved.

But now, when I compile DCGM from source code and run "sudo systemctl --now enable nvidia-dcgm", it will generate: "Failed to enable unit: Unit file nvidia-dcgm.service does not exist." OR "Failed to execute operation: No such file or directory".

So please help me, how can I start the DCGM service when I build DCGM debug version from source code?

The GPU environment on server is shown in below. image

nikkon-dev commented 1 year ago

@ligeweiwu,

You need to start the nv-hostengine process. For your layout that would be sudo ./DCGM/_out/Linux-amd64-debug/bin/nv-hostengine -f host.log --log-level debug. The nv-hostengine should be run as root, otherwise, some functionality will not be available. The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file. If you want to debug the nv-hostengine itself, then it's better to add the -n argument that would prevent the nv-hostengine from demonization.

Alternatively, you could generate .deb/.rpm packages and install them using the package manager, but that's a very inconvenient way for debugging.

ligeweiwu commented 1 year ago

@nikkon-dev It works. Thanks for your help!

ligeweiwu commented 1 year ago

@nikkon-dev Hi, I have another question. Now I have add the gpu to the Group (Group 2) successfully. And when I enter "./dcgmi diag -r 1 -g 2" It gives me an info "Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE= " How can I fix it ? Below is my working environment. Thanks.

+-------------------+----------------------------------------------------------+ | GROUPS | | 3 groups found. | +===================+==========================================================+ | Groups | | | -> 0 | | | -> Group ID | 0 | | -> Group Name | DCGM_ALL_SUPPORTED_GPUS | | -> Entities | GPU 0, GPU 1, GPU 2 | | -> 1 | | | -> Group ID | 1 | | -> Group Name | DCGM_ALL_SUPPORTED_NVSWITCHES | | -> Entities | None | | -> 2 | | | -> Group ID | 2 | | -> Group Name | GPU_GROUP | | -> Entities | GPU 2 | +-------------------+----------------------------------------------------------+

ligeweiwu commented 1 year ago

@nikkon-dev By the way, I build the debug version of DCGM with command "./build.sh -d -c" After that, it generate /_out/Linux-amd64-debug folder. Thanks.

nikkon-dev commented 1 year ago

@ligeweiwu,

Most likely, the nv-hostengine cannot find where the nvvs binary is located. It tries to find it in the default location where the package manager installs it. To override that logic, you need to set an environment variable NVVS_BIN_PATH=full path to nvvs location in the _out/Linux-amd64-debug/share/nvidia-validation-suite before running the nv-hostengine process.

ligeweiwu commented 1 year ago

@nikkon-dev Hi, thanks for your help. Now I have another question. Now, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug" again, it gives me an error: "Err: Failed to start DCGM Server: -7 User defined signal 1" How can I fix it? Thanks

nikkon-dev commented 1 year ago

@ligeweiwu,

-7 stands for DCGM_ST_INIT_ERROR = -7, //!< DCGM Init error and it's hard to tell what is wrong without the debug logs (the host.log in your command).

ligeweiwu commented 1 year ago

@nikkon-dev Hi, Below is the snapshot of host.log. By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens. image Thanks.

ligeweiwu commented 1 year ago

@nikkon-dev Hi, Below is the snapshot of host.log. By the way, this error occurs when I enter "export NVVS_BIN_PATH=full path". After that, when I enter "sudo ./nv-hostengine -n -f host.log --log-level debug", this error happens. image Thanks.

ligeweiwu commented 1 year ago

Hi, this is the content of debug log. 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 1, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101f500000002 (eg 1, entityId 2, fieldId 501) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 501, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fd00000002 (eg 1, entityId 2, fieldId 509) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 509, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101fe00000002 (eg 1, entityId 2, fieldId 510) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 510, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x101ff00000002 (eg 1, entityId 2, fieldId 511) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 511, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020000000002 (eg 1, entityId 2, fieldId 512) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 512, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding WatchInfo on entityKey 0x1020100000002 (eg 1, entityId 2, fieldId 513) [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2155] [DcgmCacheManager::GetEntityWatchInfo] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Adding new watcher type 1, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2901] [DcgmCacheManager::AddOrUpdateWatcher] 2022-12-02 10:41:26.277 DEBUG [83727:83727] UpdateWatchFromWatchers minMonitorFreqUsec 3600000000, minMaxAgeUsec 14400000000, hsw 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2948] [DcgmCacheManager::UpdateWatchFromWatchers] 2022-12-02 10:41:26.277 DEBUG [83727:83727] AddFieldWatch eg 1, eid 2, fieldId 513, mfu 3600000000, msa 14400, mka 4, sfu 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:3079] [DcgmCacheManager::AddEntityFieldWatch] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Added field group id 3, name DCGM_INTERNAL_JOB, connectionId 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmFieldGroup.cpp:172] [DcgmFieldGroupManager::AddFieldGroup] 2022-12-02 10:41:26.277 INFO [83727:83727] Created thread named "cache_mgr_main" ID 1854199552 DcgmThread ptr 0x0xe95a10 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start] 2022-12-02 10:41:26.277 DEBUG [83727:83727] Skipping waitForUpdate since the cache manager thread is not running yet. [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2394] [DcgmCacheManager::UpdateAllFields] 2022-12-02 10:41:26.278 DEBUG [83727:83727] dcgmStartEmbedded(): Embedded host engine started [/workspaces/HGDCGM/dcgmlib/src/DcgmApi.cpp:4826] [{anonymous}::StartEmbeddedV2] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Thread handle 1854199552 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83727] Entering dcgmEngineRun(unsigned short portNumber, char const *socketPath, unsigned int isConnectionTCP) (5555 127.0.0.1 1) [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun] 2022-12-02 10:41:26.278 INFO [83727:83732] Cache manager update thread starting [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6053] [DcgmCacheManager::run] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf500, eg 1, eid 1, fieldId 512 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields] 2022-12-02 10:41:26.278 INFO [83727:83727] Created thread named "dcgm_ipc" ID 1845806848 DcgmThread ptr 0x0xe94610 [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:110] [DcgmThread::Start] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus] 2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread handle 1845806848 running [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:299] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Appended entity blob eg 1, eid 1, fieldId 512, ts 1669948886278497, valueSize 2048, cached 1, buffered 0 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:6452] [DcgmCacheManager::AppendEntityBlob] 2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:326] [DcgmIpc::InitTCPListenerSocket] 2022-12-02 10:41:26.278 ERROR [83727:83733] InitTCPListenerSocket() returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:195] [DcgmIpc::run] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Preparing to update watchInfo 0xeaf360, eg 1, eid 1, fieldId 510 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:5278] [DcgmCacheManager::ActuallyUpdateAllFields] 2022-12-02 10:41:26.278 ERROR [83727:83727] initFuture returned Generic unspecified error [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:131] [DcgmIpc::Init] 2022-12-02 10:41:26.278 DEBUG [83727:83732] Checking status for gpu 1 [/workspaces/HGDCGM/dcgmlib/src/DcgmCacheManager.cpp:2275] [DcgmCacheManager::GetGpuStatus] 2022-12-02 10:41:26.278 ERROR [83727:83727] Got error Generic unspecified error from m_dcgmIpc.Init [/workspaces/HGDCGM/dcgmlib/src/DcgmHostEngineHandler.cpp:3767] [DcgmHostEngineHandler::RunServer] 2022-12-02 10:41:26.278 DEBUG [83727:83733] Thread id 1845806848 stopped [/workspaces/HGDCGM/common/DcgmThread/DcgmThread.cpp:308] [DcgmThread::RunInternal] 2022-12-02 10:41:26.278 DEBUG [83727:83727] Returning -7 [/workspaces/HGDCGM/dcgmlib/entry_point.h:73] [dcgmEngineRun]

Thanks.

nikkon-dev commented 1 year ago

@ligeweiwu,

That's an easy one - you either already have nv-hostengine process running or something else is listening on the 5555 TCP port.

2022-12-02 10:41:26.278 ERROR [83727:83733] bind failed. port 5555, address 127.0.0.1, errno 98

Normally, when nv-hostengine runs it creates a PID file that allows it to understand later if another instance is already running. That may be not true for debug builds.

ligeweiwu commented 1 year ago

@nikkon-dev Thanks for your help. This error has been solved. But when I set the environment variable "export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/" ,and then start the nv-hostengine service. After that, I run "./dcgmi diag -r 1 -g 2" (the gpu group has already been created). It still gives me an message: Couldn't parse json: 'WARNING: You must also provide env __NVVS_DBG_FILE=' By the way, when I install the deb package, this error would not happen. So is there other procedure that I missed? Thanks

nikkon-dev commented 1 year ago

@ligeweiwu,

In the debug nv-hostengine logs, look for the NVVS-related lines. Also, there may be nvvs.log file next to the nvvs binary.

ligeweiwu commented 1 year ago

@nikkon-dev Hi, this is the NVVS-related lines. 2022-12-01 22:01:31.184 DEBUG [1909:1911] [[Diag]] Unknown subcommand: 1 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:144] [DcgmModuleDiag::ProcessCoreMessage] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stdout: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:676] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] External command stderr: { "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:677] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 DEBUG [1909:1910] [[Diag]] The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Failed to parse NVVS output: WARNING: You must also provide env __NVVS_DBG_FILE= [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:1145] [DcgmDiagManager::ValidateNvvsOutput] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] Error happened during JSON parsing of NVVS output: The GPU Diagnostic returned Json that cannot be parsed. [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:885] [DcgmDiagManager::RunDiag] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] NVVS stderr: { "DCGM GPU Diagnostic" : { "runtime_error" : "Unable to get the driver version: Host engine connection invalid/disconnected. Couldn't succeed despite 0 retries.", "version" : "440.37" } } [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:886] [DcgmDiagManager::RunDiag] 2022-12-01 22:01:31.195 ERROR [1909:1910] [[Diag]] RunDiagAndAction returned -40 [/workspaces/HGDCGM/modules/diag/DcgmModuleDiag.cpp:120] [DcgmModuleDiag::ProcessRun_v6] 2022-12-01 22:01:31.195 DEBUG [1909:2000] Sending message to 17 [/workspaces/HGDCGM/common/transport/DcgmIpc.cpp:1198] [DcgmIpc::SendMessageImpl]

I see there is a message about "The external command '/usr/share/nvidia-validation-suite/nvvs' returned a non-zero exit code: 1 [/workspaces/HGDCGM/modules/diag/DcgmDiagManager.cpp:705] [DcgmDiagManager::PerformExternalCommand]" That means it use the location "/usr/share/nvidia-validation-suite/nvvs", But my workspace is not in /usr/share/nvidia-validation-suite/nvvs, and I also set the env variable NVVS_BIN_PATH to my workspace (export NVVS_BIN_PATH=XXX/GDCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/) Is there other env variable that I should be set ? (e.g. __NVVS_DBG_FILE) Thanks.

nikkon-dev commented 1 year ago

Do you set the NVVS_BIN_PATH for the root user or for your current user? If you set export NVVS_BIN_PATH= and then sudo ./nv-hostengine..., the env variable is not set for the root user that runs the nv-hostengine. Logfile should also mention the NVVS_BIN_PATH variable and its value.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon, (1): I do not find NVVS_BIN_PATH in the logfile. (2): I set export NVVS_BIN_PATH= and then sudo ./nv-hostengine... and when I set ./nv-hostengine... it works fine.

Thanks for you help.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon, Sorry to bother you again. But I still have some problem on debugging version of dcgm. Now I can run Diag test on corresponding gpu group. But when I use gdb to go into the source code, it can not find the source code correctly. For example:

  1. Build the debug version: ./build.sh -d -c After that, it generate _out/Linux-amd64-debug/bin, and the corresponding bin file like dcgmi is in it.
  2. Run the test ./dcgmi diag -r 1 -g 2. It works fine. Every thing is fine.
  3. Using gdb to go into the source code gdb --args ./dcgmi diag -r 1 -g 2 And I want to add the breakpoint at Diag::RunDiagOnce, so I b Diag::RunDiagOnce And then run, But the program will never run into the break point. And when I add some print info and rebuild it again, it can show the new printinfo. But it never reach into the breakpoint (Diag::RunDiagOnce ). So which step should I still miss? or I gdb into a wrong executable file (maybe not dcgmi)? Thanks
ligeweiwu commented 1 year ago

@nikkon-dev Another Example, if I in GDB input: b CommandLineParser.cpp:99 //Entry method into this class for a given command line provided by main() The GDB will give a feedback: Breakpoint at 0x44b61e: file /opt/cross/x86_64-linux-gnu/include/tclap/CommandLineParser.cpp, line 165.

The gdb feedback breakpoint is not the point that I set. So i think the debug symbol may be crashed dcgmi? And is there anything else that I should amend to the command "./build.sh -d -c" ?

Thanks

nikkon-dev commented 1 year ago

@ligeweiwu,

There are several steps that you need to take to be able to debug:

  1. If you were using the buildcontainer to build the dcgm, the sources location differs from your host. You will need to tell gdb the right place to find the sources and to convert buildcontainer path to your host path. Something like this should be added to a .gdbinit file:

    directory /home/your_host_user/src/DCGM
    set substitute-path /workspaces/dcgm /home/your_host_user/src/DCGM

    The buildcontainer mounts sources to /workspaces/dcgm inside the build container.

  2. dcgmi diag would not allow you to debug the diagnostic. In a nutshell, when you run dcgmi diag it connects to nv-hostengine to send the commands, and nv-hostengine runs the nvvs binary to perform actual diagnostic. So the path is dcgmi->nv-hostengine->nvvs. You will need to debug the nv-hostengine and tell gdb to follow children on forks.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon, i try it but it still have some problem. (1): my workspace is: /mnt/ssd/liuzhenli.lzl/DCGM (2): The execute path is: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin (3): I add .gdbinit file in /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin, and the content is:

directory /mnt/ssd/liuzhenli.lzl/DCGM set substitute-path /workspaces/DCGM /mnt/ssd/liuzhenli.lzl/DCGM

(4): Now I just want to test the commandLineParser function, e.g. in CommandLineParser.cpp: 1307: dcgmReturn_t CommandLineParser::ProcessDiagCommandLine(int argc, char const const argv) 1308:{ 1309: // Check for stop diag request 1310: const char *value = std::getenv(STOP_DIAG_ENV_VARIABLE_NAME); 1311: if (value != nullptr)

so I do: 4.1): gdb --args ./dcgmi diag -r 1 -g 2 (i think CommandLineParser is controlled by dcgmi, Is right ??) 4.2): b CommandLineParser.cpp:1310 And then run in gdb

(5): The gdb feedback is still: Breakpoint 1 at 0x408399: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419

(6): I also see build.sh in DCGM repo, and I see docker start command is: docker run --rm -u "$(id -u)":"$(id -g)" \ ${DOCKER_ARGS:-} \ -v "${DIR}":"${REMOTE_DIR}" \ And I print ${DIR}, it is actually:/mnt/ssd/liuzhenli.lzl/DCGM. And ${REMOTE_DIR} is: /workspaces/DCGM So i think I add set substitute-path correctly in .gdbinit file So is there anything that I do wrong?

By the way when I add other print info in CommandLineParser.cpp, and then run /dcgmi diag -r 1 -g 2, it can always print the new info.

Sorry to bother you. Thanks

nikkon-dev commented 1 year ago

It’s /workspaces/dcgm not capital letters. The path is case sensitive.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon, Now this is the content of my .gdbinit File in the path (/mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin)

directory /mnt/ssd/liuzhenli.lzl/DCGM set substitute-path /workspaces/dcgm /mnt/ssd/liuzhenli.lzl/DCGM (I have change DCGM to dcgm for /workspaces)

However, when I run gdb (gdb --args ./dcgmi diag -r 1 -g 2, b CommandLineParser.cpp:1310), It sill gives me the feed back: "Breakpoint 1 at 0x408389: file /workspaces/DCGM/dcgmi/main_dcgmi.cpp, line 3419."

May be there are some problem for gdb to go into source code of DCGM debug-version?

By the way, my gdb version is: GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2, and when I start gdb for ./dcgmi, it also gives warning: " BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010001 BFD: warning: /mnt/ssd/liuzhenli.lzl/DCGM/_out/Linux-amd64-debug/bin/../lib/libdcgm.so.3: unsupported GNU_PROPERTY_TYPE (5) type: 0xc0010002 "

Thanks for your help.

nikkon-dev commented 1 year ago

You need to update the gdb to a version that understands Dwarf5 debug info format. DCGM is built with GCC11 and uses Dwarf5 by default. You need GDB 10 or newer to be able to read that debug info.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon, thanks for your help. I have another question. As you said, "The provided command would start nv-hostengine daemon process that writes debug logs into the host.log file." For instance, if I run ./dcgmi diag -r 3 -g 2 I see the content of host.log file, it contains the debug log of many functions (e.g. function defined in DcgmCacheManager.cpp). In other words, Is possible to say that host.log file contains almost all of function, which is utilized in "./dcgmi diag -r 3 -g 2" ? Thanks.

nikkon-dev commented 1 year ago

The host.log would contain only the nv-hostengine part. The real work is done by nvvs binary with its own log. For debugging needs, you could run nvvs directly.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon The path is dcgmi->nv-hostengine->nvvs. In other words, nvvs binary executes the diagnostic test on the corresponding GPU, and nv-hostengine can use NVML function to monitor the performance metrics of GPU (e.g. power, Temperature, ecc...). Is my understanding right? Thanks.

nikkon-dev commented 1 year ago

That's correct.

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon I have another question. Does DCGM support cuda-12 now? When I update cuda driver and run the previous workflow again, it gives me an error:

_out/Linux-amd64-debug/share/nvidia-validation-suite/nvvs.log:78:2022-12-12 15:56:40.014 ERROR [29094:29094] Detected unsupported Cuda version: 12.0 [/workspaces/DCGM/nvvs/src/TestFramework.cpp:243] [TestFramework::GetPluginDirExtension]

Thanks

nikkon-dev commented 1 year ago

Cuda-12 will be supported in >=3.1.5 version. We are going to release it at the beginning of this week.

On Mon, Dec 12, 2022 at 12:26 AM ligeweiwu @.***> wrote:

@nikkon-dev https://github.com/nikkon-dev Hi nikkon I have another question. Does DCGM support cuda-12 now? When I update cuda driver and run the previous workflow again, it gives me an error:

_out/Linux-amd64-debug/share/nvidia-validation-suite/nvvs.log:78:2022-12-12 15:56:40.014 ERROR [29094:29094] Detected unsupported Cuda version: 12.0 [/workspaces/DCGM/nvvs/src/TestFramework.cpp:243] [TestFramework::GetPluginDirExtension]

Thanks

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/DCGM/issues/57#issuecomment-1346080376, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHPPZKHVQRI6TDDOVVTCVTWM3ORXANCNFSM6AAAAAASQJ7CUM . You are receiving this because you were mentioned.Message ID: @.***>

ligeweiwu commented 1 year ago

@nikkon-dev Hello nikkon Now I am debugging "dcgmi diag -r 1". As you said before, I can run nvvs binary directly for debugging purpose. However, when I run ./nvvs -h. I could not find the same test info as "dcgmi diag -r 1". Could you tell me how can I run nvvs directly to get the same effect as I run "dcgmi diag -r 1"? Thank you very much.

nikkon-dev commented 1 year ago

That's --specifiedtest short

r 1 - short r 2 - long r 3 - xlong

ligeweiwu commented 1 year ago

@nikkon-dev Thanks for your help. It works.

ligeweiwu commented 1 year ago

@nikkon-dev Hello nikkon

I have another issue for debugging. Now I am running the command "./nvvs --specifiedtest pcie" I found that GDB can follow the source code, which is located in nvvs folder. But for the source code in pcie folder, (e.g. PcieWrapper.cpp), gdb can not go into it.

Let me give a more specific example. In PluginLib.cpp, it has a callback function "m_runTestCB(timeout, numParameters, parms, m_userData);" GDB can go into the source code of PluginLib.cpp. But when I enter "s" for "m_runTestCB", I can not follow the source code of RunTest, which is defined in PcieWrapper.cpp. I also try to add other breakpoints for source code of pcie, but they do not work as well.

Could you tell me how to fix it?

Thanks

nikkon-dev commented 1 year ago

In gdb, you need to set solib-search-path to plugins/cudaXX directory, where XX is your driver's Cuda version: 10, 11 or 12

ligeweiwu commented 1 year ago

@nikkon-dev Hi nikkon

I am building the DCGM source code and using nvvs/dcgmi to perform the diagnostic test. I see all plugintest and they are all in the format of .so. .But when I want to perform the "memory bandwidth" diagnostic, they give me an error:

./dcgmi diag -r "memory bandwidth" -g 2 Error: requested test "memory bandwidth" was not found among possible test choices.

In my case, all plugin.so are in the location of /username/DCGM/_out/Linux-amd64-debug/share/nvidia-validation-suite/plugins/cuda11, and there is no name of "memory bandwidth". And I also see the source code, actually i think it doesn't have the option name "memory bandwidth". It only has "memtest".

So please tell me how can I run "memory bandwith" using DCGM source code?

By the way, the memtest is OK ("./dcgmi diag -r memtest -g 2" works fine, and I also see the corresponding libMemtest.so in plugins/cuda11, and the source code has the option "memtest").

Thanks