NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
355 stars 49 forks source link

dcgm-exporter crashes hostengine. #155

Open krono opened 4 months ago

krono commented 4 months ago

Running a 3.3.5-3.4.0 exporter on a 3.3.5 host-engine as shipped via nvidia-ubuntu-repos SEGFAULTs the Host-engine.

Is there something I can do? Shour that be reported to the exporter instead?

Logs:

dmesg crash info ``` Feb28 16:22] nvidia-nvswitch5: open (major=510) [ +0,042810] nvidia-nvswitch4: open (major=510) [ +0,042606] nvidia-nvswitch0: open (major=510) [ +0,042409] nvidia-nvswitch2: open (major=510) [ +0,042448] nvidia-nvswitch1: open (major=510) [ +0,042372] nvidia-nvswitch3: open (major=510) [Feb28 16:29] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000] [ +0,000008] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41 [ +0,155916] nvidia-nvswitch3: release (major=510) [ +0,000005] nvidia-nvswitch1: release (major=510) [ +0,000002] nvidia-nvswitch2: release (major=510) [ +0,000003] nvidia-nvswitch0: release (major=510) [ +0,000002] nvidia-nvswitch4: release (major=510) [ +0,000002] nvidia-nvswitch5: release (major=510) ```
journal for exporter and hostengine ``` Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service. Feb 28 16:21:58 gx01 nv-hostengine[1280055]: DCGM initialized Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555 Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter. Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter" Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555" Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded" Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}" Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected" Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected" Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'. Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'. ```
Versions ``` # dcgm-exporter -v --debug DCGM Exporter version 3.3.5-3.4.0 # dcgmi -v Version : 3.3.5 Build ID : 14 Build Date : 2024-02-24 Build Type : Release Commit ID : 93088b0e1286c6e7723af1930251298870e26c19 Branch Name : rel_dcgm_3_3 CPU Arch : x86_64 Build Platform : Linux 4.15.0-180-generic #189-Ubuntu SMP Wed May 18 14:13:57 UTC 2022 x86_64 CRC : 08a0d9624b562a1342bf5f8828939294 ```
apt-cache policy datacenter-gpu-manager ``` # apt-cache policy datacenter-gpu-manager datacenter-gpu-manager: Installed: 1:3.3.5 Candidate: 1:3.3.5 Version table: *** 1:3.3.5 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 100 /var/lib/dpkg/status 1:3.3.3 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.3.1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.3.0 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.2.6 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.2.5 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.2.3 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.1.8 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.1.7 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.1.6 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.1.3 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:3.0.4 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.4.8 600 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.4.7 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.4.6 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.4.5 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.3.6 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.3.5 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.3.4 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.3.2 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.3.1 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.2.9 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.2.8 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.2.3 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.1.8 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.1.7 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.1.4 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.0.15 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 1:2.0.14 600 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal-updates/common amd64 Packages 1:2.0.13 600 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 600 https://repo.download.nvidia.com/baseos/ubuntu/focal/x86_64 focal/common amd64 Packages ```
OS info ``` # cat /etc/dgx-release DGX_NAME="DGX Server" DGX_PRETTY_NAME="NVIDIA DGX Server" DGX_SWBUILD_DATE="2020-10-26-11-53-11" DGX_SWBUILD_VERSION="5.0.0" DGX_COMMIT_ID="7501dff" DGX_PLATFORM="DGX Server for DGX A100" DGX_SERIAL_NUMBER="XXXXXXXXXXXX" DGX_OTA_VERSION="5.0.5" DGX_OTA_DATE="XXXXXXXXXXXXXXXXX" DGX_OTA_VERSION="5.1.1" DGX_OTA_DATE="XXXXXXXXXXXXXXXXX" DGX_OTA_VERSION="5.2.0" DGX_OTA_DATE="XXXXXXXXXXXXXXXXX" DGX_OTA_VERSION="5.3.1" DGX_OTA_DATE="XXXXXXXXXXXXXXXXX" DGX_OTA_VERSION="5.5.1" DGX_OTA_DATE="XXXXXXXXXXXXXXXXX" ```
superg commented 4 months ago

Hi @krono, Thank you for the report. Is the issue easily reproducible? Would it be possible to request nv-hostengine core dump?

EDIT: follow up questions Do you get any syslog kernel error messages for NVLink in 16:21 - 16:29 timeframe?

krono commented 4 months ago

Hi @superg (somehow I don't get gh mails anymore, sorry)

kernel syslog messages in timeframe ``` root@gx01:/var/log# grep '^Feb 28 16:[23]' syslog.1 Feb 28 16:20:50 gx01 kernel: [103278.185814] audit: type=1400 audit(1709133650.187:1080): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1278914/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:21:20 gx01 kernel: [103308.432689] audit: type=1400 audit(1709133680.436:1081): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279373/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:21:30 gx01 kernel: [103318.719235] audit: type=1400 audit(1709133690.720:1082): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279512/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:21:31 gx01 kernel: [103319.282115] audit: type=1400 audit(1709133691.284:1083): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279555/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:21:31 gx01 kernel: [103319.491742] audit: type=1400 audit(1709133691.492:1084): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1279656/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:21:57 gx01 systemd[1]: Started NVIDIA DCGM service. Feb 28 16:21:57 gx01 kernel: [103345.680201] nvidia-nvswitch5: open (major=510) Feb 28 16:21:57 gx01 kernel: [103345.723011] nvidia-nvswitch4: open (major=510) Feb 28 16:21:57 gx01 kernel: [103345.765617] nvidia-nvswitch0: open (major=510) Feb 28 16:21:57 gx01 kernel: [103345.808026] nvidia-nvswitch2: open (major=510) Feb 28 16:21:57 gx01 kernel: [103345.850474] nvidia-nvswitch1: open (major=510) Feb 28 16:21:57 gx01 kernel: [103345.892846] nvidia-nvswitch3: open (major=510) Feb 28 16:21:58 gx01 nv-hostengine: DCGM initialized Feb 28 16:21:58 gx01 nv-hostengine[1280055]: Started host engine version 3.3.5 using port number: 5555 Feb 28 16:22:03 gx01 kernel: [103351.124025] audit: type=1400 audit(1709133723.125:1085): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1280110/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:22:32 gx01 systemd[1]: Started DCGM Exporter. Feb 28 16:22:32 gx01 kernel: [103380.216261] audit: type=1400 audit(1709133752.217:1086): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter) Feb 28 16:22:32 gx01 dcgm-exporter[1280577]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter) Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE Feb 28 16:22:32 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'. Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Scheduled restart job, restart counter is at 3. Feb 28 16:23:02 gx01 systemd[1]: Stopped DCGM Exporter. Feb 28 16:23:02 gx01 systemd[1]: Started DCGM Exporter. Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter) Feb 28 16:23:02 gx01 dcgm-exporter[1280963]: /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /net/mgmtdelab/pool/html/dcgm/3.3.5/x86_64/bin/dcgm-exporter) Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE Feb 28 16:23:02 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'. Feb 28 16:23:07 gx01 systemd[1]: Stopped DCGM Exporter. Feb 28 16:24:11 gx01 kernel: [103479.881565] audit: type=1400 audit(1709133851.887:1087): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281838/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:12 gx01 kernel: [103480.388663] audit: type=1400 audit(1709133852.395:1088): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281885/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:12 gx01 kernel: [103480.539563] audit: type=1400 audit(1709133852.543:1089): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281908/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:13 gx01 kernel: [103481.137739] audit: type=1400 audit(1709133853.143:1090): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281946/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:13 gx01 kernel: [103481.651807] audit: type=1400 audit(1709133853.655:1091): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1281992/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:13 gx01 kernel: [103481.804767] audit: type=1400 audit(1709133853.811:1092): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282016/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded. Feb 28 16:24:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded. Feb 28 16:25:01 gx01 kernel: [103529.974717] audit: type=1400 audit(1709133901.980:1093): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282647/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:25:01 gx01 CRON[1282648]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb 28 16:25:01 gx01 kernel: [103529.976017] audit: type=1400 audit(1709133901.984:1094): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1282648/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:25:52 gx01 kernel: [103580.650881] audit: type=1400 audit(1709133952.657:1095): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1284754/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:26:22 gx01 kernel: [103610.898676] audit: type=1400 audit(1709133982.906:1096): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1285172/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:28:11 gx01 kernel: [103719.026823] audit: type=1400 audit(1709134091.036:1097): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297267/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:28:11 gx01 kernel: [103719.579399] audit: type=1400 audit(1709134091.588:1098): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297309/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:28:11 gx01 kernel: [103719.755666] audit: type=1400 audit(1709134091.764:1099): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1297333/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:29:08 gx01 systemd[1]: Started DCGM Exporter. Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Starting dcgm-exporter" Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="Attemping to connect to remote hostengine at localhost:5555" Feb 28 16:29:08 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:08+01:00" level=info msg="DCGM successfully initialized!" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Collecting DCP Metrics" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Falling back to metric file '/net/mgmtdelab/pool/html/dcgm/current/counters.csv'" Feb 28 16:29:09 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:09+01:00" level=info msg="Initializing system entities of type: GPU" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvSwitch" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: NvLink" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Initializing system entities of type: CPU Core" Feb 28 16:29:11 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:11+01:00" level=info msg="Not collecting CPU Core metrics; Error retrieving DCGM MIG hierarchy: This request is serviced by a module of DCGM that is not currently loaded" Feb 28 16:29:13 gx01 kernel: [103781.951727] nv-hostengine[1280071]: segfault at 28 ip 00007f09f65c74b2 sp 00007f09f61e2ba0 error 6 in libdcgmmodulenvswitch.so.3.3.5[7f09f658c000+f8000] Feb 28 16:29:13 gx01 kernel: [103781.951735] Code: 7d b8 44 88 6d b0 e8 7d 0a ff ff 48 8b 45 a8 48 8b 73 18 48 89 45 c0 48 3b 73 20 0f 84 df 00 00 00 66 0f 6f 45 b0 48 83 c6 18 <0f> 11 46 e8 48 8b 45 c0 48 89 46 f8 48 89 73 18 48 8d 65 d8 5b 41 Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="can not destroy group" error="Error destroying group: Host engine connection invalid/disconnected" groupID="{21}" Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=warning msg="Cannot destroy field group." error="Host engine connection invalid/disconnected" Feb 28 16:29:14 gx01 dcgm-exporter[1298060]: time="2024-02-28T16:29:14+01:00" level=fatal msg="Failed to watch metrics: Error watching fields: Host engine connection invalid/disconnected" Feb 28 16:29:14 gx01 kernel: [103782.107651] nvidia-nvswitch3: release (major=510) Feb 28 16:29:14 gx01 kernel: [103782.107656] nvidia-nvswitch1: release (major=510) Feb 28 16:29:14 gx01 kernel: [103782.107658] nvidia-nvswitch2: release (major=510) Feb 28 16:29:14 gx01 kernel: [103782.107661] nvidia-nvswitch0: release (major=510) Feb 28 16:29:14 gx01 kernel: [103782.107663] nvidia-nvswitch4: release (major=510) Feb 28 16:29:14 gx01 kernel: [103782.107665] nvidia-nvswitch5: release (major=510) Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Main process exited, code=exited, status=1/FAILURE Feb 28 16:29:14 gx01 systemd[1]: dcgm-exporter.service: Failed with result 'exit-code'. Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Main process exited, code=killed, status=11/SEGV Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Failed with result 'signal'. Feb 28 16:29:14 gx01 systemd[1]: nvidia-dcgm.service: Scheduled restart job, restart counter is at 1. Feb 28 16:29:14 gx01 systemd[1]: Stopped NVIDIA DCGM service. Feb 28 16:29:14 gx01 systemd[1]: Started NVIDIA DCGM service. Feb 28 16:29:14 gx01 kernel: [103782.832440] nvidia-nvswitch5: open (major=510) Feb 28 16:29:14 gx01 kernel: [103782.875110] nvidia-nvswitch4: open (major=510) Feb 28 16:29:14 gx01 kernel: [103782.918412] nvidia-nvswitch0: open (major=510) Feb 28 16:29:14 gx01 kernel: [103782.961611] nvidia-nvswitch2: open (major=510) Feb 28 16:29:15 gx01 kernel: [103783.004035] nvidia-nvswitch1: open (major=510) Feb 28 16:29:15 gx01 kernel: [103783.046633] nvidia-nvswitch3: open (major=510) Feb 28 16:29:15 gx01 nv-hostengine: DCGM initialized Feb 28 16:29:15 gx01 nv-hostengine[1298239]: Started host engine version 3.3.5 using port number: 5555 Feb 28 16:29:24 gx01 systemd[1]: Stopped DCGM Exporter. Feb 28 16:29:24 gx01 systemd[1]: Stopping NVIDIA DCGM service... Feb 28 16:29:24 gx01 kernel: [103792.219790] nvidia-nvswitch3: release (major=510) Feb 28 16:29:24 gx01 kernel: [103792.219989] nvidia-nvswitch1: release (major=510) Feb 28 16:29:24 gx01 kernel: [103792.220182] nvidia-nvswitch2: release (major=510) Feb 28 16:29:24 gx01 kernel: [103792.220374] nvidia-nvswitch0: release (major=510) Feb 28 16:29:24 gx01 kernel: [103792.220575] nvidia-nvswitch4: release (major=510) Feb 28 16:29:24 gx01 kernel: [103792.220761] nvidia-nvswitch5: release (major=510) Feb 28 16:29:24 gx01 systemd[1]: nvidia-dcgm.service: Succeeded. Feb 28 16:29:24 gx01 systemd[1]: Stopped NVIDIA DCGM service. Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: launch task StepId=838896.12 request from UID:12211 GID:5101 HOST:172.20.26.64 PORT:39002 Feb 28 16:29:32 gx01 kernel: [103800.729913] audit: type=1400 audit(1709134172.738:1100): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/27067/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: lllp_distribution: JobId=838896 implicit auto binding: cores, dist 1 Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _task_layout_lllp_cyclic: _task_layout_lllp_cyclic Feb 28 16:29:32 gx01 slurmd[27067]: slurmd: task/affinity: _lllp_generate_cpu_bind: _lllp_generate_cpu_bind jobid [838896]: mask_cpu, 0x0000000000000001000000000000000000000000000000010000000000000000 Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded. Feb 28 16:29:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded. Feb 28 16:30:55 gx01 kernel: [103883.096874] audit: type=1400 audit(1709134255.111:1101): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299484/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:31:25 gx01 kernel: [103913.358790] audit: type=1400 audit(1709134285.372:1102): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1299958/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:32:28 gx01 kernel: [103976.699653] audit: type=1400 audit(1709134348.713:1103): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/28026/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded. Feb 28 16:34:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded. Feb 28 16:35:01 gx01 CRON[1302542]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1) Feb 28 16:35:01 gx01 kernel: [104129.968910] audit: type=1400 audit(1709134501.985:1104): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302541/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:35:01 gx01 kernel: [104129.970106] audit: type=1400 audit(1709134501.985:1105): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1302542/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:35:57 gx01 kernel: [104185.572325] audit: type=1400 audit(1709134557.590:1106): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303272/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:36:27 gx01 kernel: [104215.823679] audit: type=1400 audit(1709134587.842:1107): apparmor="ALLOWED" operation="open" profile="/usr/sbin/sssd" name="/proc/1303523/cmdline" pid=27038 comm="sssd_nss" requested_mask="r" denied_mask="r" fsuid=0 ouid=0 Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme1n1p1.mount: Succeeded. Feb 28 16:39:54 gx01 systemd[1]: tmp-.esp_tmp-nvme2n1p1.mount: Succeeded. root@gx01:/var/log# ```

Reproducible? for me everytime. Coredump:

coredumpctl ``` # coredumpctl dump nv-hostengine >nv-hostengine.core PID: 3124092 (nv-hostengine) UID: 0 (root) GID: 0 (root) Signal: 11 (SEGV) Timestamp: Fri 2024-03-01 10:07:47 CET (59s ago) Command Line: /usr/bin/nv-hostengine -n --service-account nvidia-dcgm Executable: /usr/bin/nv-hostengine Control Group: /system.slice/nvidia-dcgm.service Unit: nvidia-dcgm.service Slice: system.slice Boot ID: 35a3b73c95c04716880c91638ed46a93 Machine ID: 5b05d12040d24d8e9c8d38117ab12eba Hostname: gx01 Storage: /var/lib/systemd/coredump/core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4 Message: Process 3124092 (nv-hostengine) of user 0 dumped core. Stack trace of thread 3124104: #0 0x00007fc97150a4b2 n/a (libdcgmmodulenvswitch.so.3 + 0x594b2) #1 0x00007fc9711c5456 n/a (libnvidia-nscq.so.2 + 0x9d456) #2 0x00007fc9711760a3 n/a (libnvidia-nscq.so.2 + 0x4e0a3) #3 0x00007fc9711756bf n/a (libnvidia-nscq.so.2 + 0x4d6bf) #4 0x00007fc97117eb80 nscq_session_path_observe (libnvidia-nscq.so.2 + 0x56b80) #5 0x00007fc9715530e7 n/a (libdcgmmodulenvswitch.so.3 + 0xa20e7) #6 0x00007fc97152038f n/a (libdcgmmodulenvswitch.so.3 + 0x6f38f) #7 0x00007fc9714eb19d n/a (libdcgmmodulenvswitch.so.3 + 0x3a19d) #8 0x00007fc9714d6e9f n/a (libdcgmmodulenvswitch.so.3 + 0x25e9f) #9 0x00007fc9714d7834 n/a (libdcgmmodulenvswitch.so.3 + 0x26834) #10 0x00007fc9714dafd8 n/a (libdcgmmodulenvswitch.so.3 + 0x29fd8) #11 0x00007fc9714dc4a4 n/a (libdcgmmodulenvswitch.so.3 + 0x2b4a4) #12 0x00007fc9714e43b6 n/a (libdcgmmodulenvswitch.so.3 + 0x333b6) #13 0x00007fc9714dabb1 n/a (libdcgmmodulenvswitch.so.3 + 0x29bb1) #14 0x00007fc971561e5b n/a (libdcgmmodulenvswitch.so.3 + 0xb0e5b) #15 0x00007fc9715623a9 n/a (libdcgmmodulenvswitch.so.3 + 0xb13a9) #16 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609) #17 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 3124101: #0 0x00007fc973f2895d syscall (libc.so.6 + 0x11895d) #1 0x00007fc971598791 n/a (libdcgmmodulenvswitch.so.3 + 0xe7791) #2 0x00007fc9714d8a74 n/a (libdcgmmodulenvswitch.so.3 + 0x27a74) #3 0x00007fc97152a408 n/a (libdcgmmodulenvswitch.so.3 + 0x79408) #4 0x00007fc974201343 n/a (libdcgm.so.3 + 0x6c343) #5 0x00007fc9742c742e n/a (libdcgm.so.3 + 0x13242e) #6 0x00007fc974326025 n/a (libdcgm.so.3 + 0x191025) #7 0x00007fc974334e9c n/a (libdcgm.so.3 + 0x19fe9c) #8 0x00007fc974321438 n/a (libdcgm.so.3 + 0x18c438) #9 0x00007fc9742c9aa7 n/a (libdcgm.so.3 + 0x134aa7) #10 0x00007fc9742c9d2d n/a (libdcgm.so.3 + 0x134d2d) #11 0x00007fc9742c9f1f n/a (libdcgm.so.3 + 0x134f1f) #12 0x00007fc9742a71fe n/a (libdcgm.so.3 + 0x1121fe) #13 0x00007fc974348871 n/a (libdcgm.so.3 + 0x1b3871) #14 0x00007fc974352ad8 n/a (libdcgm.so.3 + 0x1bdad8) #15 0x00007fc9743530a4 n/a (libdcgm.so.3 + 0x1be0a4) #16 0x00007fc974231de6 n/a (libdcgm.so.3 + 0x9cde6) #17 0x00007fc974356192 n/a (libdcgm.so.3 + 0x1c1192) #18 0x00007fc9744111c8 n/a (libdcgm.so.3 + 0x27c1c8) #19 0x00007fc97417a609 start_thread (libpthread.so.0 + 0x8609) #20 0x00007fc973f2f353 __clone (libc.so.6 + 0x11f353) Stack trace of thread 3124092: #0 0x00007fc973eed23f clock_nanosleep (libc.so.6 + 0xdd23f) #1 0x00007fc973ef2ec7 __nanosleep (libc.so.6 + 0xe2ec7) #2 0x000000000040736b n/a (nv-hostengine + 0x736b) #3 0x00007fc973e34083 __libc_start_main (libc.so.6 + 0x24083) #4 0x00000000004079bc n/a (nv-hostengine + 0x79bc) ```

core.nv-hostengine.0.35a3b73c95c04716880c91638ed46a93.3124092.1709284067000000000000.lz4.zip

superg commented 4 months ago

Thank you for the dumps, I am currently looking into it. Will share my findings here.

superg commented 4 months ago

I narrowed the search down to one of the NSCQ observe callbacks in: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/DcgmNvSwitchManager.cpp#L851C45-L851C53

To understand more I would like to request debug level logs when the crash happen, here's how to do it: Make sure nvidia-dcgm service is running (nv-hostengine), execute dcgmi set --logging-severity DEBUG, that will set nv-hostengine logging level to DEBUG. Next, reproduce the crash and share /var/nv-hostengine.log (feel free to clear it beforehand if needed).

krono commented 4 months ago

Hi, here's the log nv-hostengine.log

krono commented 4 months ago

I do not see the log point log_debug("Attaching to NvSwitches"); being hit…

superg commented 4 months ago

Thank you for the logs. indeed it crashed in another NSCQ observe callback: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164

I am currently looking into the chain of events that led into this, will reply once I have more information.

superg commented 3 months ago

Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.

superg commented 3 months ago

Unfortunately we aren't able to reproduce this issue internally. However we've added better debugging to help diagnose such issues in the future and at some point it will be merged to GitHub.

krono commented 3 months ago

Is there any way i can debug that? Like a step through debugger?

superg commented 3 months ago

Yes, basically you will have to build debug DCGM with symbols. Then you will be able to use GDB, step through code and inspect variables etc. Put a breakpoint here: https://github.com/NVIDIA/DCGM/blob/master/modules/nvswitch/FieldDefinitions.cpp#L164 Just want to mention that this is pretty advanced and involves using ./build.sh script and docker dcgmbuild container (our build is containerized) and running nv-hostengine locally.

krono commented 3 months ago

So here we are.

Debuggin around this:

    auto cb = [](const indexTypes... indicies,
                 nscq_rc_t rc,
                 TempData<nscqFieldType, storageType, is_vector, indexTypes...>::cbType in,
                 NscqDataCollector<TempData<nscqFieldType, storageType, is_vector, indexTypes...>> *dest) {
        if (dest == nullptr)
        {
            log_error("NSCQ passed dest = nullptr");

            return;
        }

        dest->callCounter++;

        if (NSCQ_ERROR(rc))
        {
            log_error("NSCQ {} passed error {}", dest->nscqPath, (int)rc);

            TempData<nscqFieldType, storageType, is_vector, indexTypes...> item;

            item.CollectFunc(dest, indicies...);

            return;
        }

        TempData<nscqFieldType, storageType, is_vector, indexTypes...> item; /* BREAKPOINT HERE */

        item.CollectFunc(dest, in, indicies...);
    };

shows:

Normal behavior for stuff like tempreatures or throughput:

gdb debug output for `*dest`: normal stuff ``` Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields, false, nscq_uuid_t*>(unsigned short, DcgmFvBuffer&, std::vector > const&, long)::{lambda(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector, false, nscq_uuid_t*> >*)#1}::operator()(nscq_uuid_t*, signed char, nscq_link_throughput_t, DcgmNs::NscqDataCollector, false, nscq_uuid_t*> >*) const (__closure=0x0, indicies#0=0x564fc0, rc=0 '\000', in=..., dest=0x7ffff4afb290) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162 162 TempData item; $64 = { callCounter = 6, fieldId = 862, nscqPath = 0x7ffff4fedcc0 "/{nvswitch}/nvlink/throughput_counters", data = std::vector of length 5, capacity 8 = {{ index = std::tuple containing = { [1] = 0x564ef0 }, data = { = { value = 0 }, members of DcgmNs::FieldIdStorageType<862>: static fieldId = 862 } }, { index = std::tuple containing = { [1] = 0x564d50 }, data = { = { value = 0 }, members of DcgmNs::FieldIdStorageType<862>: static fieldId = 862 } }, { index = std::tuple containing = { [1] = 0x564e20 }, data = { = { value = 0 }, members of DcgmNs::FieldIdStorageType<862>: static fieldId = 862 } }, { index = std::tuple containing = { [1] = 0x53df00 }, data = { = { value = 0 }, members of DcgmNs::FieldIdStorageType<862>: static fieldId = 862 } }, { index = std::tuple containing = { [1] = 0x565090 }, data = { = { value = 0 }, members of DcgmNs::FieldIdStorageType<862>: static fieldId = 862 } }} } ```

This is more or less expected.

It seems something breaks for "physical id":

  1. We see the backtrace requests "/{nvswitch}/id/phys_id"
gdb bt at that point for `phys id ``` (gdb) bt #0 DcgmNs::DcgmNvSwitchManager::UpdateFields, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162 #1 0x00007ffff4eeb858 in DcgmNs::DcgmNvSwitchManager::UpdateFields, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*) () at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:138 #2 0x00007ffff4b9b456 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2 #3 0x00007ffff4b4c0a3 in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2 #4 0x00007ffff4b4b6bf in ?? () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2 #5 0x00007ffff4b54b80 in nscq_session_path_observe () from /lib/x86_64-linux-gnu/libnvidia-nscq.so.2 #6 0x00007ffff4f636ca in nscq_session_path_observe (session=0x7681b0, path=0x7ffff4fed8d0 "/{nvswitch}/id/phys_id", callback=0x7ffff4eeb813 , false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*)#1}::_FUN(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*)>, data=0x7ffff4afb280, flags=0) at /srv/DCGM/sdk/nvidia/nscq/dlwrap/dlwrap.c:131 #7 0x00007ffff4eeb98d in DcgmNs::DcgmNvSwitchManager::UpdateFields, false, nscq_uuid_t*, unsigned char> (this=0x53cb10, fieldId=863, buf=..., entities=std::vector of length 1, capacity 1 = {...}, now=1712069504430580) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:167 #8 0x00007ffff4ec28a5 in DcgmNs::DcgmNvSwitchManager::UpdateFields (this=0x53cb10, nextUpdateTime=@0x7ffff4afb528: 1712069518487312) at /srv/DCGM/modules/nvswitch/DcgmNvSwitchManager.cpp:592 #9 0x00007ffff4ea9b6a in DcgmNs::DcgmModuleNvSwitch::RunOnce (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:400 #10 0x00007ffff4ea9d6d in DcgmNs::DcgmModuleNvSwitch::TryRunOnce (this=0x53c970, forceRun=true) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:419 #11 0x00007ffff4ea8428 in operator() (__closure=0x7fffd4036cf0) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:273 #12 0x00007ffff4eaadae in std::__invoke_impl&>(std::__invoke_other, struct {...} &) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61 #13 0x00007ffff4eaabdb in std::__invoke_r&>(struct {...} &) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:111 #14 0x00007ffff4eaa937 in std::_Function_handler >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291 #15 0x00007ffff4ebb3f4 in std::function::operator()() const (this=0x7fffd4036cf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560 #16 0x00007ffff4eb92ad in std::__invoke_impl const&>(std::__invoke_other, std::function const&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61 #17 0x00007ffff4eb646b in std::__invoke const&>(std::function const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96 #18 0x00007ffff4eb1267 in std::invoke const&>(std::function const&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97 #19 0x00007ffff4eada36 in DcgmNs::Task::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}::operator()() const (__closure=0x7fffd4036cf0) at /srv/DCGM/common/Task.hpp:215 #20 0x00007ffff4ebb46a in std::__invoke_impl::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}&>(std::__invoke_other, DcgmNs::Task::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61 #21 0x00007ffff4eb9474 in std::__invoke_r, DcgmNs::Task::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}&>(std::optional&&, (DcgmNs::Task::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}&)...) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:114 #22 0x00007ffff4eb6538 in std::_Function_handler (), DcgmNs::Task::Task(std::__cxx11::basic_string, std::allocator >, std::function)::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:291 #23 0x00007ffff4ec08da in std::function ()>::operator()() const (this=0x76bdf0) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/std_function.h:560 #24 0x00007ffff4ec0522 in std::__invoke_impl, std::function ()>&>(std::__invoke_other, std::function ()>&) (__f=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:61 #25 0x00007ffff4ec028e in std::__invoke ()>&>(std::function ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/bits/invoke.h:96 #26 0x00007ffff4ebffd0 in std::invoke ()>&>(std::function ()>&) (__fn=...) at /opt/cross/x86_64-linux-gnu/include/c++/11.2.0/functional:97 #27 0x00007ffff4ebfc4f in DcgmNs::NamedBasicTask::Run (this=0x76bde0) at /srv/DCGM/common/Task.hpp:155 #28 0x00007ffff4eaeca9 in DcgmNs::TaskRunner::Run (this=0x53ca58, oneIteration=true) at /srv/DCGM/common/TaskRunner.hpp:432 #29 0x00007ffff4ea9e2c in DcgmNs::DcgmModuleNvSwitch::run (this=0x53c970) at /srv/DCGM/modules/nvswitch/DcgmModuleNvSwitch.cpp:433 #30 0x00007ffff4f6bba4 in DcgmThread::RunInternal (this=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:308 #31 0x00007ffff4f6a7c5 in dcgmthread_starter (parm=0x53c9b8) at /srv/DCGM/common/DcgmThread/DcgmThread.cpp:34 #32 0x00007ffff7bfa609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0 #33 0x00007ffff79af353 in clone () from /lib/x86_64-linux-gnu/libc.so.6 ```
  1. From that we would expect fieldId to be 863, but it is proabbly garbage: "32767"
gdb debug output for `*dest`: strange stuff ``` Thread 5 "nv-hostengine" hit Breakpoint 2, DcgmNs::DcgmNvSwitchManager::UpdateFields, false, nscq_uuid_t*, unsigned char>(unsigned short, DcgmFvBuffer&, std::vector > const&, long)::{lambda(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*)#1}::operator()(nscq_uuid_t*, unsigned char, signed char, unsigned long, DcgmNs::NscqDataCollector, false, nscq_uuid_t*, unsigned char> >*) const (__closure=0x0, indicies#0=0x564ef0, indicies#1=0 '\000', rc=11 '\v', in=140737298543232, dest=0x716a20) at /srv/DCGM/modules/nvswitch/FieldDefinitions.cpp:162 162 TempData item; $65 = { callCounter = 4108821041, fieldId = 32767, nscqPath = 0x712960 "SWX-F8F7054E-5993-EB8D-786D-B59D5303DB16", data = std::vector of length 0, capacity -1 } ```

The callCounter looks goofy, too. Most important, the nscqPath is not the expected "/{nvswitch}/id/phys_id" but rather the value?

-=-=-=-

It seem that there's something wrong in my /usr/lib/x86_64-linux-gnu/libnvidia-nscq.so.2, because it looks like the library is just calling this with broken info.

Lib info:

apt policy libnvidia-nscq-535 ``` libnvidia-nscq-535: Installed: 535.154.05-0ubuntu0.20.04.1 Candidate: 535.161.07-0ubuntu0.20.04.1 Version table: 535.161.08-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.161.07-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.161.07-0ubuntu0.20.04.1 600 500 http://de.archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages 500 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages 535.154.05-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages *** 535.154.05-0ubuntu0.20.04.1 100 100 /var/lib/dpkg/status 535.129.03-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.104.12-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.104.05-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.86.10-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages 535.54.03-1 580 580 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64 Packages ```

So the promblem is probably not DCGM but rahter lib NSCQ?

superg commented 3 months ago

Hi, thank you for the details! Let me process this information and get back to you.

superg commented 3 months ago

@krono , I apologize for the long wait. We've managed to reproduce the issue on our side. While our call stack is different, the source of the problem is very likely to be the same and your observations on std::vector<> with garbage supports it. I believe that the fix that we're working on will resolve it.

krono commented 3 months ago

thanks :)

krono commented 2 months ago

Hi @superg , any news or any place I can read up on the issue here?

superg commented 2 months ago

Hi @krono, We have an internal tracking ticket for this issue and an assigned developer, this is still work in progress.

The issue is with the callback signature (after all the template instantiations), we use: void callback(const nscq_uuid_t* device, nscq_rc_t rc, std::vector<nscq_error_t>, void *data) whereas NSCQ expects: void callback(const nscq_uuid_t* device, nscq_rc_t rc, const nscq_error_t error, void* data) for a given path type. Callback code has to be rewritten for the second signature.

krono commented 2 months ago

oh my.

Which component will need updat? DCGM or NSCQ?

superg commented 2 months ago

That's in DCGM.

krono commented 2 months ago

Thanks! I'll keep watching this space

krono commented 1 month ago

I have now a second machine that fell victim to that problem: HGX-Bases system, similarly configured.

krono commented 2 weeks ago

Hey, any news?

superg commented 2 weeks ago

@krono, the issue is identified and we are working on a fix. The current ETA is August.