charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
204 stars 49 forks source link

Checkpoint/restart hangs in SMP mode #2613

Open minitu opened 4 years ago

minitu commented 4 years ago

With tests/charm++/chkpt, the code hangs when it is checkpointed and restarted using 2 processes each with 2 worker threads on LC Lassen. This is actually not included in the Makefile's test, which checkpoints with 2 processes but tests restart only with 1 process (once with 2 worker threads and another with 4 worker threads).

jsrun -n2 -a1 -c2 -g1 -K1 -r2 ./hello ++ppn 2 +setcpuaffinity +showcpuaffinity -> jsrun -n2 -a1 -c2 -g1 -K1 -r2 ./hello +restart log ++ppn 2 +setcpuaffinity +showcpuaffinity

Output:

choi18@lassen710:chkpt$ jsrun -n2 -a1 -c2 -g1 -K1 -r2 ./hello +restart log ++ppn 2 +setcpuaffinity +showcpuaffinity
Choosing optimized barrier algorithm name I0:HybridBinomial:SHMEM:P2P
Charm++> Running in SMP mode: 2 processes, 2 worker threads (PEs) + 0 comm threads per process, 4 PEs total
Charm++> There's no comm. thread. Work threads both send and receive messages
Converse/Charm++ Commit ID: v6.10.0-rc2-43-g90154c2
CharmLB> Load balancer assumes all CPUs are same.
Charm++> cpu affinity enabled.
[1] thread CPU affinity mask is 0x00000200
[0] thread CPU affinity mask is 0x00000100
[2] thread CPU affinity mask is 0x00000400
[3] thread CPU affinity mask is 0x00000800
Charm++> Running on 1 hosts (2 sockets x 20 cores x 4 PUs = 160-way SMP)
Charm++> cpu topology info is gathered in 0.002 seconds.
Received 1 arguments: { |./hello| }
Main's MigCtor. a=987(0x111c7b8c), b[0]=654(0x111c7b90), b[1]=321, old PE number 4
Main's PUPer. a=123(0x111c7b8c), b[0]=456(0x111c7b90), b[1]=789
CHello's PUPer. step=3.
[1] data on Group 1
[0] data on Group 0
[0] data on NOdeGroup 0
[0]CkRestartMain done. sending out callback.
[3] data on Group 3
[2] data on Group 2
[1] data on NOdeGroup 1
minitu commented 2 years ago

Still hangs on OLCF Summit.