charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
202 stars 49 forks source link

NAMD occasionally crashes on Frontera with mpi-smp builds #2849

Open nitbhat opened 4 years ago

nitbhat commented 4 years ago

v6.10.1 runs successfully.

However, in the master branch, it looks commit (dd8b5a2df5c0822e3c398fc069e852fba1385189) introduces this bug. The bug is caused because of a message that has an incorrect zcMsgType, because of which, it executes the CkUnpackRdmaPtrs and crashes inside the pup function.

1110 ------------- Processor 821 Exiting: Caught Signal ------------$
1111 Reason: Segmentation fault$
1112 [821] Stack Traceback:$
1113   [821:0] namd2 0xf26627 $
1114   [821:1] libpthread.so.0 0x2b48d570e5d0 $
1115   [821:2] libc.so.6 0x2b48d7b58070 $
1116   [821:3] namd2 0xf48eec PUP::fromMem::bytes(void*, unsigned long, unsigned long, PUP::dataType)$
1117   [821:4] namd2 0xea361a CkUnpackRdmaPtrs(char*)$
1118   [821:5] namd2 0xe333a1 CkUnpackMessage(envelope**)$
1119   [821:6] namd2 0xe3670a _processHandler(void*, CkCoreState*)$
1120   [821:7] namd2 0xf27ab1 CsdScheduleForever$
1121   [821:8] namd2 0xf27d45 CsdScheduler$
1122   [821:9] namd2 0xf25b52 $
1123   [821:10] namd2 0xf25c38 $
1124   [821:11] libpthread.so.0 0x2b48d5706dd5 $
1125   [821:12] libc.so.6 0x2b48d7b0002d clone$
1126 Abort(1) on node 63 (rank 63 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 63$
1127 register.h> CkRegisteredInfo<40,> called with invalid index 170 (should be less than 0)$
ericjbohm commented 3 years ago

Is this still an issue?