Open stwhite91 opened 6 years ago
Original date: 2018-02-14 16:27:42
I looked into this further via gdb and debug build. It looks like the process for the second partition hangs in CPU topology initialization. This does not happen if there is only one PE per partition.
Charm++> CMA enabled for within node transfers using the zerocopy API
Converse/Charm++ Commit ID: v6.8.2-282-gddd93ef3a
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
^C
Program received signal SIGINT, Interrupt.
0x00007ffff7267730 in __poll_nocancel ()
at ../sysdeps/unix/syscall-template.S:84
84 ../sysdeps/unix/syscall-template.S: No such file or directory.
(gdb) bt
#0 0x00007ffff7267730 in __poll_nocancel ()
at ../sysdeps/unix/syscall-template.S:84
#1 0x0000000000631a0d in CheckSocketsReady (withDelayMs=0, output=1)
at machine-tcp.c:151
#2 0x0000000000631d3b in CommunicationServerNet (sleepTime=0, where=2)
at machine-tcp.c:243
#3 0x0000000000632aa7 in LrtsAdvanceCommunication (whileidle=0)
at machine.c:1672
#4 0x000000000062e9be in AdvanceCommunication (whenidle=0)
at machine-common-core.c:1392
#5 0x000000000062ec29 in CmiGetNonLocal () at machine-common-core.c:1562
#6 0x0000000000634f34 in CsdNextMessage (s=0x7fffffffe290) at convcore.c:1754
#7 0x0000000000635280 in CsdSchedulePoll () at convcore.c:1949
#8 0x000000000064b3b2 in LrtsInitCpuTopo (argv=0x7fffffffe808)
at cputopology.C:604
#9 0x000000000064b60d in CmiInitCPUTopology (argv=0x7fffffffe808)
at cputopology.C:694
#10 0x0000000000547a3c in _initCharm (unused_argc=2, argv=0x7fffffffe808)
at init.C:1362
#11 0x000000000062e977 in ConverseRunPE (everReturn=0)
at machine-common-core.c:1371
#12 0x000000000062e892 in ConverseInit (argc=4, argv=0x7fffffffe808,
fn=0x5473b5 <_initCharm(int, char**)>, usched=0, initret=0)
Original date: 2018-03-01 04:45:57
Please run this partitions test on v6.8.2 to see if the recent changes to charmrun caused this issue
Original date: 2018-03-01 19:29:40
I see the same hang for this test even with charm v6.8.2
Original date: 2018-03-05 18:37:49
Gerrit patch to skip the partitions test for TCP build: https://charm.cs.illinois.edu/gerrit/#/c/3811/ https://github.com/UIUC-PPL/charm/commit/17a36c6cca7c697457a2fbf533d8b4f89eb0fdb4
Original date: 2018-03-05 18:42:28
The patch looks good as far as not letting autobuild hang any more. I'm not sure if we can call the issue implemented though, since it only avoids our infrastructure failing instead of resolving the issue.
Original date: 2018-03-06 05:50:46
Yes, let's leave the issue open, since the patch is just a workaround. I don't think the real issue is a release blocker for 6.9.0 though, since tcp builds are rare.
Original issue: https://charm.cs.illinois.edu/redmine/issues/1796
The test appears to finish but doesn't actually exit.