LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors

jcphill commented 10 years ago

Original issue: https://charm.cs.illinois.edu/redmine/issues/529

I'm trying to run ibverbs-smp NAMD on of Stampede and on 512+ nodes I regularly get segfaults during startup that I've tracked down to the fact that on the net layer LrtsInitCpuTopo() does it's global physical node search by hijacking the message loop until it finishes. This means that NAMD's WorkDistrib group tends to be created on some nodes before the topology information is generated.

By some miracle the MIC port doesn't have this issue. I think David Kunzman was encountering it but he has his MIC startup code sitting at the exact right place to introduce a delay. I can probably do a workaround for the NAMD release, but this is really confusing.

ericjbohm commented 5 years ago

Original date: 2014-07-07 19:27:32

Core group decision:

Make "delay until main" an option. Also detect condition where a topology query is made before it is initialized and issue meaningful error.

jcphill commented 5 years ago

Original date: 2014-07-28 16:06:25

A workaround (basically CcdCallOnCondition(CcdTOPOLOGY_AVAIL, ...)) has been checked into NAMD so this issue is no longer urgent, but it is a surprising behavior that should be fixed.

nikhil-jain commented 5 years ago

Original date: 2015-09-17 20:18:17

Ping. What is the status here? Is this bug for real.

PhilMiller commented 5 years ago

Original date: 2015-09-22 20:15:25

NAMD has its workaround, so not critical.

bilgeacun commented 5 years ago

Original date: 2017-04-13 20:56:48

This issue https://charm.cs.illinois.edu/redmine/issues/1381 might also be related to this.

PhilMiller commented 5 years ago

Original date: 2017-04-25 20:52:22

Based on the preference for MPI machine layer on OmniPath systems for now, we're deferring this.

PhilMiller commented 5 years ago

Original date: 2017-06-22 14:34:15

This may have been addressed by ~~https://charm.cs.illinois.edu/gerrit/#/c/2723/~~ https://github.com/UIUC-PPL/charm/commit/e9191d90ba87d917ad9ed202d0be06ebc98ab553 that synchronizes startup to ensure topology is available before reaching user code.

stwhite91 commented 5 years ago

Original date: 2017-06-23 13:05:05

The above merged change doesn't change anything for bug #1381: Crash in LrtsInitCpuTopo() on Quartz with verbs layer.

./charmrun +p2 ./hello ++mpiexec ++remote-shell ./mysrun
Charmrun> scalable start enabled. 
Charmrun> IBVERBS version of charmrun
Charmrun> started all node programs in 0.214 seconds.
Charm++> Running in non-SMP mode: numPes 2
Converse/Charm++ Commit ID: v6.7.0-1005-g8b8bb11
Charm++> scheduler running in netpoll mode.
CharmLB> Load balancer assumes all CPUs are same.
------------- Processor 0 Exiting: Called CmiAbort ------------
Reason: 

        Length mismatch!!

[0] Stack Traceback:
  [0:0] CmiAbortHelper+0xb3  [0x613f78]
  [0:1] CmiAbort+0x2d  [0x613fb3]
  [0:2]   [0x618d3a]
  [0:3]   [0x618fda]
  [0:4]   [0x618928]
  [0:5]   [0x618843]
  [0:6]   [0x618c54]
  [0:7] LrtsAdvanceCommunication+0x1a  [0x61bd6e]
  [0:8]   [0x613dac]
  [0:9] CmiGetNonLocal+0x75  [0x61402a]
  [0:10] CsdNextMessage+0x9b  [0x61e12d]
  [0:11] CsdSchedulePoll+0x73  [0x61e471]
  [0:12] LrtsInitCpuTopo+0x2e5  [0x634491]
  [0:13] CmiInitCPUTopology+0x18  [0x634672]
  [0:14] _Z10_initCharmiPPc+0x651  [0x534ad2]
  [0:15]   [0x613d66]
  [0:16] ConverseInit+0x324  [0x613c82]
  [0:17] main+0x3f  [0x5327bc]
  [0:18] __libc_start_main+0xf5  [0x2aaaab929b35]
  [0:19]   [0x52dab9]

juanjgalvez commented 5 years ago

Original date: 2017-06-23 14:11:27

Makes sense, looking at bug #1381 the error occurs while doing low-level processing of messages sent during LrtsInitCpuTopo(), in the first phase which is something that happens before the code added in my patch. Error is likely to be either processing the CmiReduce (if error is in PE0 it may be the CmiReduce), or processing the broadcast message sent from PE0.

juanjgalvez commented 5 years ago

Original date: 2017-06-30 19:02:40

This particular bug has probably been solved by the recent cputopology patches: ~~https://charm.cs.illinois.edu/gerrit/#/c/2723/~~ https://github.com/UIUC-PPL/charm/commit/e9191d90ba87d917ad9ed202d0be06ebc98ab553 ~~https://charm.cs.illinois.edu/gerrit/#/c/2735/~~ https://github.com/UIUC-PPL/charm/commit/6a393c85cd04e4f83bcc651d8253a853456efd3f

But it should be tested (NAMD workaround would probably need to be disabled to verify).

stwhite91 commented 5 years ago

Original date: 2017-06-30 19:13:15

I still get the exact same "Length mismatch!!" abort with verbs-linux-x86_64 on Quartz on today's master version on charm.

juanjgalvez commented 5 years ago

Original date: 2017-06-30 19:14:28

I'm referring to this bug (529), not 1381.

juanjgalvez commented 5 years ago

Original date: 2017-09-27 19:59:29

I'm pretty sure this bug has been solved because the above mentioned patches prevent Charm init from progressing on all PEs until InitCpuTopo has completed.

Also, there is no way to currently replicate it. Stampede was decommissioned and there is no way to do large scale ibverb runs.

For now, pushing this to 6.9 but NAMD group should decide whether to retire it.

charmplusplus / charm

LrtsInitCpuTopo() operates asynchronously and overlaps with group constructors #529