chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 420 forks source link

gasnet-ibv segment large errors with --cache-remote #14340

Closed ronawho closed 4 years ago

ronawho commented 4 years ago

With --cache-remote there are 7 failures for release/examples runtime/configMatters under gasnet-ibv-large.

Here's the simplest reproducer I have so far:

on Locales[1] do
  writeln("", here.id);

On master, it fails with:

chpl ibv-repro.chpl --cache-remote
./ibv-repro -nl 2

*** FATAL ERROR (proc 1): ibv_reg_mr failed in firehose_move_callback errno=14 (Bad address)

which looks like https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3338. I think that bug was resolved with the latest gasnet release. Upgrading to 2019.9.0 I see a different error:

@ 1> rcv comp->status=5
@ 1> - snd CQ contains impossibly large WCE count with status 5
*** FATAL ERROR (proc 1): aborting on reap of failed recv
@ 1> rcv comp->status=5
@ 1> - snd CQ contains impossibly large WCE count with status 5
*** FATAL ERROR (proc 1): aborting on reap of failed AM recv
[1] Invoking GDB for backtrace...
mlx5: prod-0009: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000000 00008813 10017696 00606bd2
@ 0> snd status=10 opcode=2 dst_node=1 dst_qp=0
@ 0> - rcv CQ contains impossibly large WCE count with status 10
*** FATAL ERROR (proc 0): aborting on reap of failed send

which I think is https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4008

Running with GASNET_USE_ODP=0 does not change the failure mode.

I need to investigate why we're only seeing these failures with --cache-remote.

PHHargrove commented 4 years ago

@ronawho, with PG&E having cut power to LBNL again, I am left groping in the dark for an answer. I honestly cannot see how the behaviors of bug3338/4008 would still occur with GASNET_USE_ODP=1. However, the change in behavior you report when moving between GASNet 2019.6.0 and 2019.9.0 certainly is consistent with the changes to resolve Put-from-R/O-source when no using ODP.

If you are able to run you reproducer with a debug build of GASNet, set env vars GASNET_TRACEFILE=tr% GASNET_TRACEMASK=D GASNET_TRACENODES=0 and examine the node=0 trace file, tr0, to look for "FIREHOSE_MOVE: read-only memory found at ...". The presence of such a line will confirm that the code related to fixing bug 3838 (in absence of ODP) has actually run.

By any chance is Chapel's --cache-remote making use of mprotect()? If so, this might invalidate an assumption GASNet is making. If so, I could provide a small patch to prevent ibv-conduit from remembering it has seen a read-only page.

ronawho commented 4 years ago

--cache-remote itself doesn't use mprotect(), but the tasking layer does to create stack guard pages. If I disable guard page I'm able to run:

chpl ibv-repro.chpl --cache-remote --no-stack-checks
./ibv-repro -nl 2
1
PHHargrove commented 4 years ago

Hmm. Disabling of guard pages seems suspicious to me. This would seem to indicate that the Chapel runtime is attempting to communicate the content of a guard page!

ronawho commented 4 years ago

Hmm, yeah that does seem suspicious.

I'm starting to think this could be a bug in chpl_comm_addr_gettable(), which is used by some of the prefetching code for --cache-remote. chpl_comm_addr_gettable() looks like:

https://github.com/chapel-lang/chapel/blob/d11518bb03dcb986608991dc32637eb410a6ca48/runtime/src/comm/gasnet/comm-gasnet.c#L681-L703

I'm not familiar with that code, but it looks like the assumption is that if some region is within the segment it is directly gettable, but that doesn't take into account guard pages. I'm also not sure if that check makes sense in the context of dynamic registration.

I need to dig into this some more and probably chat with @mppf.

Thanks for looking at at this @PHHargrove, and sorry for the false alarm if this ends up being a bug on our end.

mppf commented 4 years ago

Ideally, the prefetch code would run GETs in a way that ignore invalid memory addresses. Probably what is happening here is just that the GASNet configuration uses guard pages (b/c not huge pages) even though ugni normally does not. For the short term, I think we should make the caching code disable prefetching when guard pages are enabled (it should already disable prefetching if there is no defined segment e.g. with segment=everything).