Open hanwen-pcluste opened 10 months ago
@hanwen-pcluste,
Hello, could you try the 3.3.1 version? There was a crash in one of the diag modules, and I wonder if that's exactly what you see here.
@nikkon-dev,
The error persists with 3.3.1:
Nov 21 14:02:32 ip-192-168-96-92 systemd: Started NVIDIA DCGM service.
Nov 21 14:02:33 ip-192-168-96-92 nv-hostengine: DCGM initialized
Nov 21 14:02:33 ip-192-168-96-92 nv-hostengine: Started host engine version 3.3.1 using port number: 5555
Nov 21 14:02:33 ip-192-168-96-92 kernel: dcgmi[4902]: segfault at 14849f9ec9b8 ip 00001484a12f25d9 sp 000014849f9ec9c0 error 6 in libdcgm.so.3.3.1[1484a1229000+27d000]
Nov 21 14:02:33 ip-192-168-96-92 kernel: Code: 90 4c 89 ef 41 be fd ff ff ff e8 52 a5 f5 ff e9 3e fe ff ff 0f 1f 44 00 00 4c 8d b5 00 b6 ea ff ba d0 49 15 00 31 f6 4c 89 f7 <e8> 72 6b f3 ff 89 9d 18 b6 ea ff 48 8d bd 20 b6 ea ff c7 85 00 b6
Nov 21 14:02:33 ip-192-168-96-92 systemd: Stopping NVIDIA DCGM service...
Nov 21 14:02:33 ip-192-168-96-92 kernel: NVRM serverFreeResourceTree: hObject 0x0 not found for client 0xc1d0000f
Nov 21 14:02:33 ip-192-168-96-92 systemd: Stopped NVIDIA DCGM service.
Nov 21 14:02:33 ip-192-168-96-92 slurmd: slurmd: error: prolog failed: rc:139 output:
Nov 21 14:02:33 ip-192-168-96-92 slurmd: slurmd: error: [job 2] prolog failed status=139:0
@hanwen-pcluste,
Unfortunately, I'm not able to reproduce this segfault in our environment. Can this command be run with a debug build under gdb to collect call stacks?
@nikkon-dev ,
we can try. We are currently using the NVIDIA repositories to install the DCGM tool. Would you be able to provide an RPM or DEB (we can either use RHEL derivatives or Ubuntu) built with debug flags?
Hello!
We are from AWS ParallelCluster team.
After 3.3.0 release, we notice the following errors in
/var/log/messages
when runningdcgmi diag -i $0 -r $2
:Version 3.2.6 does not have the problem
Thank you!