NVIDIA / DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
Apache License 2.0
387 stars 50 forks source link

New segmentation fault from version v3.3.0 #134

Open hanwen-pcluste opened 10 months ago

hanwen-pcluste commented 10 months ago

Hello!

We are from AWS ParallelCluster team.

After 3.3.0 release, we notice the following errors in /var/log/messages when running dcgmi diag -i $0 -r $2:

Nov 20 17:42:15 ip-192-168-94-71 systemd: Started NVIDIA DCGM service.
Nov 20 17:42:15 ip-192-168-94-71 nv-hostengine: DCGM initialized
Nov 20 17:42:15 ip-192-168-94-71 nv-hostengine: Started host engine version 3.3.0 using port number: 5555
Nov 20 17:43:55 ip-192-168-94-71 kernel: dcgmi[5133]: segfault at 14f98a5179b8 ip 000014f98be1d529 sp 000014f98a5179c0 error 6 in libdcgm.so.3.3.0[14f98bd54000+27c000]
Nov 20 17:43:55 ip-192-168-94-71 kernel: Code: 90 4c 89 ef 41 be fd ff ff ff e8 52 a5 f5 ff e9 3e fe ff ff 0f 1f 44 00 00 4c 8d b5 00 b6 ea ff ba d0 49 15 00 31 f6 4c 89 f7 <e8> 22 6c f3 ff 89 9d 18 b6 ea ff 48 8d bd 20 b6 ea ff c7 85 00 b6
Nov 20 17:43:55 ip-192-168-94-71 systemd: Stopping NVIDIA DCGM service...
Nov 20 17:43:55 ip-192-168-94-71 kernel: NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d0000f
Nov 20 17:43:55 ip-192-168-94-71 systemd: Stopped NVIDIA DCGM service.
Nov 20 17:43:55 ip-192-168-94-71 slurmd: slurmd: error: prolog failed: rc:139 output:
Nov 20 17:43:55 ip-192-168-94-71 slurmd: slurmd: prolog for job 1 ran for 100 seconds
Nov 20 17:43:55 ip-192-168-94-71 slurmd: slurmd: error: [job 1] prolog failed status=139:0

Version 3.2.6 does not have the problem

Thank you!

nikkon-dev commented 10 months ago

@hanwen-pcluste,

Hello, could you try the 3.3.1 version? There was a crash in one of the diag modules, and I wonder if that's exactly what you see here.

hanwen-pcluste commented 10 months ago

@nikkon-dev,

The error persists with 3.3.1:

Nov 21 14:02:32 ip-192-168-96-92 systemd: Started NVIDIA DCGM service.
Nov 21 14:02:33 ip-192-168-96-92 nv-hostengine: DCGM initialized
Nov 21 14:02:33 ip-192-168-96-92 nv-hostengine: Started host engine version 3.3.1 using port number: 5555
Nov 21 14:02:33 ip-192-168-96-92 kernel: dcgmi[4902]: segfault at 14849f9ec9b8 ip 00001484a12f25d9 sp 000014849f9ec9c0 error 6 in libdcgm.so.3.3.1[1484a1229000+27d000]
Nov 21 14:02:33 ip-192-168-96-92 kernel: Code: 90 4c 89 ef 41 be fd ff ff ff e8 52 a5 f5 ff e9 3e fe ff ff 0f 1f 44 00 00 4c 8d b5 00 b6 ea ff ba d0 49 15 00 31 f6 4c 89 f7 <e8> 72 6b f3 ff 89 9d 18 b6 ea ff 48 8d bd 20 b6 ea ff c7 85 00 b6
Nov 21 14:02:33 ip-192-168-96-92 systemd: Stopping NVIDIA DCGM service...
Nov 21 14:02:33 ip-192-168-96-92 kernel: NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d0000f
Nov 21 14:02:33 ip-192-168-96-92 systemd: Stopped NVIDIA DCGM service.
Nov 21 14:02:33 ip-192-168-96-92 slurmd: slurmd: error: prolog failed: rc:139 output:
Nov 21 14:02:33 ip-192-168-96-92 slurmd: slurmd: error: [job 2] prolog failed status=139:0
nikkon-dev commented 9 months ago

@hanwen-pcluste,

Unfortunately, I'm not able to reproduce this segfault in our environment. Can this command be run with a debug build under gdb to collect call stacks?

jdeamicis commented 9 months ago

@nikkon-dev ,

we can try. We are currently using the NVIDIA repositories to install the DCGM tool. Would you be able to provide an RPM or DEB (we can either use RHEL derivatives or Ubuntu) built with debug flags?