charmplusplus / charm

The Charm++ parallel programming system. Visit https://charmplusplus.org/ for more information.
Apache License 2.0
203 stars 49 forks source link

Support for partitions on netlrts-linux-x86_64-tcp builds #1796

Open stwhite91 opened 6 years ago

stwhite91 commented 6 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/1796


The test appears to finish but doesn't actually exit.

../../../bin/testrun  ./hello +p4 10 2 +partitions 2 ++local
Charmrun> scalable start enabled. 
Charmrun> started all node programs in 0.009 seconds.
Charm++> Running in non-SMP mode: numPes 4
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: 879601b
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
Charm++> Running on 1 unique compute nodes (8-way SMP).
Charm++> cpu topology info is gathered in 0.000 seconds.
Running Hello on 2 processors for 10 elements
[0] Hello 0 created
[0] Hello 1 created
[0] Hello 2 created
[0] Hello 3 created
[0] Hello 4 created
[0] Hi[17] from element 0
[0] Hi[18] from element 1
[0] Hi[19] from element 2
[0] Hi[20] from element 3
[0] Hi[21] from element 4
[1] Hello 5 created
[1] Hello 6 created
[1] Hello 7 created
[1] Hello 8 created
[1] Hello 9 created
[1] Hi[22] from element 5
[1] Hi[23] from element 6
[1] Hi[24] from element 7
[1] Hi[25] from element 8
[1] Hi[26] from element 9
All done
[Partition 0][Node 0] End of program
karthiksenthil commented 5 years ago

Original date: 2018-02-14 16:27:42


I looked into this further via gdb and debug build. It looks like the process for the second partition hangs in CPU topology initialization. This does not happen if there is only one PE per partition.

Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-282-gddd93ef3a
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
84      ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0  0x00007ffff7267730 in __poll_nocancel ()
    at ../sysdeps/unix/syscall-template.S:84
#1  0x0000000000631a0d in CheckSocketsReady (withDelayMs=0, output=1)
    at machine-tcp.c:151
#2  0x0000000000631d3b in CommunicationServerNet (sleepTime=0, where=2)
    at machine-tcp.c:243
#3  0x0000000000632aa7 in LrtsAdvanceCommunication (whileidle=0)
    at machine.c:1672
#4  0x000000000062e9be in AdvanceCommunication (whenidle=0)
    at machine-common-core.c:1392
#5  0x000000000062ec29 in CmiGetNonLocal () at machine-common-core.c:1562
#6  0x0000000000634f34 in CsdNextMessage (s=0x7fffffffe290) at convcore.c:1754
#7  0x0000000000635280 in CsdSchedulePoll () at convcore.c:1949
#8  0x000000000064b3b2 in LrtsInitCpuTopo (argv=0x7fffffffe808)
    at cputopology.C:604
#9  0x000000000064b60d in CmiInitCPUTopology (argv=0x7fffffffe808)
    at cputopology.C:694
#10 0x0000000000547a3c in _initCharm (unused_argc=2, argv=0x7fffffffe808)
    at init.C:1362
#11 0x000000000062e977 in ConverseRunPE (everReturn=0)
    at machine-common-core.c:1371
#12 0x000000000062e892 in ConverseInit (argc=4, argv=0x7fffffffe808, 
    fn=0x5473b5 <_initCharm(int, char**)>, usched=0, initret=0)
stwhite91 commented 5 years ago

Original date: 2018-03-01 04:45:57


Please run this partitions test on v6.8.2 to see if the recent changes to charmrun caused this issue

karthiksenthil commented 5 years ago

Original date: 2018-03-01 19:29:40


I see the same hang for this test even with charm v6.8.2

karthiksenthil commented 5 years ago

Original date: 2018-03-05 18:37:49


Gerrit patch to skip the partitions test for TCP build: https://charm.cs.illinois.edu/gerrit/#/c/3811/ https://github.com/UIUC-PPL/charm/commit/17a36c6cca7c697457a2fbf533d8b4f89eb0fdb4

evan-charmworks commented 5 years ago

Original date: 2018-03-05 18:42:28


The patch looks good as far as not letting autobuild hang any more. I'm not sure if we can call the issue implemented though, since it only avoids our infrastructure failing instead of resolving the issue.

stwhite91 commented 5 years ago

Original date: 2018-03-06 05:50:46


Yes, let's leave the issue open, since the patch is just a workaround. I don't think the real issue is a release blocker for 6.9.0 though, since tcp builds are rare.