charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
200 stars 50 forks source link

NAMD pami build on summit does not print stack trace on segfault or abort #2339

Open jcphill opened 5 years ago

jcphill commented 5 years ago

mpi-linux-ppc64le-smp provides a useful stack trace, e.g.:

[a23n04:66475] *** Process received signal ***
[a23n04:66475] Signal: Segmentation fault (11)
[a23n04:66475] Signal code: Address not mapped (1)
[a23n04:66475] Failing at address: 0xa0
[a23n04:66475] [ 0] [0x2000000504d8]
[a23n04:66475] [ 1] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2(CmiSetCPUAffinity+0x60)[0x1122c440]
[a23n04:66475] [ 2] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2(CmiInitCPUAffinity+0x47c)[0x1122d57c]
[a23n04:66475] [ 3] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2(_Z10_initCharmiPPc+0x2390)[0x110374b0]
[a23n04:66475] [ 4] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2(_Z10slave_initiPPc+0x1e0)[0x10369320]
[a23n04:66475] [ 5] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2[0x11202f8c]
[a23n04:66475] [ 6] namd2/Linux-POWER-xlC.cudamemoptmpi/namd2[0x11205b10]
[a23n04:66475] [ 7] /lib64/libpthread.so.0(+0x8b94)[0x2000001f8b94]
[a23n04:66475] [ 8] /lib64/libc.so.6(clone+0xe4)[0x2000088a85f4]
[a23n04:66475] *** End of error message ***
ERROR: One or more process (first noticed rank 1) terminated with signal 11

pami-linux-ppc64le-smp provides only:

ERROR: One or more process (first noticed rank 8) terminated with signal 11

It looks like MPI has some kind of signal handler that generates the stack trace.

jcphill commented 5 years ago

pamilrts does print a stack trace, at least post-startup. If you specify a replica count that doesn't divide the node count you get a segfault with no error message or stack trace.

jcphill commented 5 years ago

In src/arch/util/machine-common-core.C CmiAbortHelper calls CmiPrintStackTrace().

jcphill commented 5 years ago

But in src/arch/pami/machine.C we have:

void CmiAbort(const char * message) {
    CmiError("------------- Processor %d Exiting: Called CmiAbort ------------\n"
             "{snd:%d,rcv:%d} Reason: %s\n",CmiMyPe(),
             MSGQLEN(), ORECVS(), message);

    //CmiPrintStackTrace(0);
    //while (msgQueueLen > 0 || outstanding_recvs > 0) {
    //  AdvanceCommunications();
    //}
    //CmiBarrier();
    assert (0);
    CMI_NORETURN_FUNCTION_END
}