Closed jcphill closed 6 years ago
Original date: 2014-07-07 19:27:32
Core group decision:
Make "delay until main" an option. Also detect condition where a topology query is made before it is initialized and issue meaningful error.
Original date: 2014-07-28 16:06:25
A workaround (basically CcdCallOnCondition(CcdTOPOLOGY_AVAIL, ...)) has been checked into NAMD so this issue is no longer urgent, but it is a surprising behavior that should be fixed.
Original date: 2015-09-17 20:18:17
Ping. What is the status here? Is this bug for real.
Original date: 2015-09-22 20:15:25
NAMD has its workaround, so not critical.
Original date: 2017-04-13 20:56:48
This issue https://charm.cs.illinois.edu/redmine/issues/1381 might also be related to this.
Original date: 2017-04-25 20:52:22
Based on the preference for MPI machine layer on OmniPath systems for now, we're deferring this.
Original date: 2017-06-22 14:34:15
This may have been addressed by https://charm.cs.illinois.edu/gerrit/#/c/2723/ https://github.com/UIUC-PPL/charm/commit/e9191d90ba87d917ad9ed202d0be06ebc98ab553 that synchronizes startup to ensure topology is available before reaching user code.
Original date: 2017-06-23 13:05:05
The above merged change doesn't change anything for bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer.
./charmrun +p2 ./hello ++mpiexec ++remote-shell ./mysrun
Charmrun> scalable start enabled.
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.214 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.7.0-1005-g8b8bb11
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason:
Length mismatch!!
[0] Stack Traceback:
[0:0] CmiAbortHelper+0xb3 [0x613f78]
[0:1] CmiAbort+0x2d [0x613fb3]
[0:2] [0x618d3a]
[0:3] [0x618fda]
[0:4] [0x618928]
[0:5] [0x618843]
[0:6] [0x618c54]
[0:7] LrtsAdvanceCommunication+0x1a [0x61bd6e]
[0:8] [0x613dac]
[0:9] CmiGetNonLocal+0x75 [0x61402a]
[0:10] CsdNextMessage+0x9b [0x61e12d]
[0:11] CsdSchedulePoll+0x73 [0x61e471]
[0:12] LrtsInitCpuTopo+0x2e5 [0x634491]
[0:13] CmiInitCPUTopology+0x18 [0x634672]
[0:14] _Z10_initCharmiPPc+0x651 [0x534ad2]
[0:15] [0x613d66]
[0:16] ConverseInit+0x324 [0x613c82]
[0:17] main+0x3f [0x5327bc]
[0:18] __libc_start_main+0xf5 [0x2aaaab929b35]
[0:19] [0x52dab9]
Original date: 2017-06-23 14:11:27
Makes sense, looking at bug #1381 the error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in my patch. Error is likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0.
Original date: 2017-06-30 19:02:40
This particular bug has probably been solved by the recent cputopology patches:
https://charm.cs.illinois.edu/gerrit/#/c/2723/ https://github.com/UIUC-PPL/charm/commit/e9191d90ba87d917ad9ed202d0be06ebc98ab553
https://charm.cs.illinois.edu/gerrit/#/c/2735/ https://github.com/UIUC-PPL/charm/commit/6a393c85cd04e4f83bcc651d8253a853456efd3f
But it should be tested (NAMD workaround would probably need to be disabled to verify).
Original date: 2017-06-30 19:13:15
I still get the exact same "Length mismatch!!" abort with verbs-linux-x86_64 on Quartz on today's master version on charm.
Original date: 2017-06-30 19:14:28
I'm referring to this bug (529), not 1381.
Original date: 2017-09-27 19:59:29
I'm pretty sure this bug has been solved because the above mentioned patches prevent Charm init from progressing on all PEs until InitCpuTopo has completed.
Also, there is no way to currently replicate it. Stampede was decommissioned and there is no way to do large scale ibverb runs.
For now, pushing this to 6.9 but NAMD group should decide whether to retire it.
Original issue: https://charm.cs.illinois.edu/redmine/issues/529
I'm trying to run ibverbs-smp NAMD on of Stampede and on 512+ nodes I regularly get segfaults during startup that I've tracked down to the fact that on the net layer LrtsInitCpuTopo() does it's global physical node search by hijacking the message loop until it finishes. This means that NAMD's WorkDistrib group tends to be created on some nodes before the topology information is generated.
By some miracle the MIC port doesn't have this issue. I think David Kunzman was encountering it but he has his MIC startup code sitting at the exact right place to introduce a delay. I can probably do a workaround for the NAMD release, but this is really confusing.