charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
207 stars 50 forks source link

Group section failure #3479

Open slm960323 opened 3 years ago

slm960323 commented 3 years ago

https://github.com/UIUC-PPL/charm/tree/debug_groupsection

I want to implement the case where there are 2 PEs, (PE0 and PE2) creating one section each involving all chare elements. For each section, it will do a broadcast call to recvMsg , reduce to repeat, increment iteration count and then repeat. The code does compile but the run hangs.

slm960323 commented 3 years ago

Charm++> Running on MPI version: 3.1 Charm++> level of thread support used: -1 (desired: 0) Charm++> Running in SMP mode: 4 processes, 1 worker threads (PEs) + 1 comm threads per process, 4 PEs total Charm++> The comm. thread both sends and receives messages Converse/Charm++ Commit ID: v7.1.0-devel-37-ga84e4523f Charm++ built with internal error checking enabled. Do not use for performance benchmarking (build without --enable-error-checking to do so). Charm++: Tracemode Projections enabled. Trace: traceroot: /home/simengl2/School/cp_charm/examples/charm++/groupsection/trace/check_proj Isomalloc> Synchronized global address space. CharmLB> Load balancer assumes all CPUs are same. Charm++> Running on 1 hosts (2 sockets x 20 cores x 1 PUs = 40-way SMP) Charm++> cpu topology info is gathered in 0.002 seconds. Numpes: 4 Leader [0] :: broadcast groupId 10 CkMulticastMgr 0x7fffdc0403d0 Leader [2] :: broadcast groupId -1073741825 CkMulticastMgr 0x7fffdc04a3b0 PE [2] :: sectionBcastMsg received from Leader [0] @ iteration 1 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [0] :: sectionBcastMsg received from Leader [0] @ iteration 1 groupId 10 CkMulticastMgr 0x7fffdc0403d0 PE [1] :: sectionBcastMsg received from Leader [0] @ iteration 1 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [3] :: sectionBcastMsg received from Leader [0] @ iteration 1 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 Leader [0] Iteration 1 PE [0] :: sectionBcastMsg received from Leader [0] @ iteration 2 groupId 10 CkMulticastMgr 0x7fffdc0403d0 PE [0] :: sectionBcastMsg received from Leader [2] @ iteration 1 groupId -1073741825 CkMulticastMgr 0x7fffdc03fdc0 Leader [0] Iteration 2 PE [2] :: sectionBcastMsg received from Leader [0] @ iteration 2 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [2] :: sectionBcastMsg received from Leader [2] @ iteration 1 groupId -1073741825 CkMulticastMgr 0x7fffdc04a3b0 PE [1] :: sectionBcastMsg received from Leader [0] @ iteration 2 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [1] :: sectionBcastMsg received from Leader [2] @ iteration 1 groupId -1073741825 CkMulticastMgr 0x7fffdc03f970 PE [3] :: sectionBcastMsg received from Leader [0] @ iteration 2 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [3] :: sectionBcastMsg received from Leader [2] @ iteration 1 groupId -1073741825 CkMulticastMgr 0x7fffdc03f970 PE [0] :: sectionBcastMsg received from Leader [0] @ iteration 3 groupId 10 CkMulticastMgr 0x7fffdc0403d0 PE [2] :: sectionBcastMsg received from Leader [0] @ iteration 3 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [3] :: sectionBcastMsg received from Leader [0] @ iteration 3 groupId 10 CkMulticastMgr 0x7fffdc03e6c0 PE [1] :: sectionBcastMsg received from Leader [0] @ iteration 3 groupId 10 CkMulticastMgr 0x7fffdc03e6c0