charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
200 stars 50 forks source link

MSA Examples Failing #3561

Open jszaday opened 2 years ago

jszaday commented 2 years ago

Two of the MSA examples are broken.

examples/multiphaseSharedArrays/matmul does not compile. After superficial fixes, it will crash with:

Running as 1 OS processes: t2d 2 1048576 100 500 100 1
charmrun> /usr/bin/setarch x86_64 -R mpirun -np 1 t2d 2 1048576 100 500 100 1
Charm++> Running in non-SMP mode: 1 processes (PEs)
Converse/Charm++ Commit ID: v7.1.0-devel-132-g2d58c2fb7
Charm++ built with internal error checking enabled.
Do not use for performance benchmarking (build without --enable-error-checking to do so).
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 hosts (1 sockets x 4 cores x 2 PUs = 8-way SMP)
Charm++> cpu topology info is gathered in 0.102 seconds.
[cordelia:160910:0:160910] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x2e0)
1   100 500 500 100 2   1048576 U   0.047026    5000    1   cordelia.local
==== backtrace (tid: 160910) ====
 0  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7ffff7dae534]
 1  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(+0x2d76f) [0x7ffff7dae76f]
 2  /home/szaday2/workspace/ucx/build/lib/libucs.so.0(+0x2da56) [0x7ffff7daea56]
 3  /lib/x86_64-linux-gnu/libc.so.6(+0x46520) [0x7ffff784e520]
 4  t2d(_ZN14MSA_CacheGroupId12DefaultEntryIdLb0EELj5000EE10accessPageEj16MSA_Page_Fault_t+0x1a) [0x4b638a]
 5  t2d(_ZN17CkIndex_TestArray22_callthr_Kontinue_voidEP12CkThrCallArg+0x3f8) [0x4ac5f8]
 6  t2d(CthStartThread+0x12) [0x5e68e2]
 7  t2d(make_fcontext+0x2f) [0x5e6d5f]
=================================
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node cordelia exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

real    0m2.385s
user    0m0.077s
sys 0m0.043s
make: *** [Makefile:52: test] Error 139

At the time of the failure, the state of the cache group (MSA_CacheGroup::pageTable in particular) seems to be invalid.

Likewise, examples/multiphaseSharedArrays/moldyn does not compile. After superficial fixes, it will hang.

BJWiley233 commented 2 years ago

How do you even build the msa library along with LIBS? Do you put them in quotes with the build script: ./build "target1 target2 ..." as in ./build "LIBS msa"?

jszaday commented 2 years ago

I am unsure about how to compile MSA with the build script.

I typically run make from src/libs/ck-libs/multiphaseSharedArrays/ to make -module msa available.