Closed ronawho closed 4 years ago
@ronawho, with PG&E having cut power to LBNL again, I am left groping in the dark for an answer. I honestly cannot see how the behaviors of bug3338/4008 would still occur with GASNET_USE_ODP=1
. However, the change in behavior you report when moving between GASNet 2019.6.0 and 2019.9.0 certainly is consistent with the changes to resolve Put-from-R/O-source when no using ODP.
If you are able to run you reproducer with a debug build of GASNet, set env vars GASNET_TRACEFILE=tr% GASNET_TRACEMASK=D GASNET_TRACENODES=0
and examine the node=0 trace file, tr0
, to look for "FIREHOSE_MOVE: read-only memory found at ...". The presence of such a line will confirm that the code related to fixing bug 3838 (in absence of ODP) has actually run.
By any chance is Chapel's --cache-remote
making use of mprotect()
? If so, this might invalidate an assumption GASNet is making. If so, I could provide a small patch to prevent ibv-conduit from remembering it has seen a read-only page.
--cache-remote
itself doesn't use mprotect()
, but the tasking layer does to create stack guard pages. If I disable guard page I'm able to run:
chpl ibv-repro.chpl --cache-remote --no-stack-checks
./ibv-repro -nl 2
1
Hmm. Disabling of guard pages seems suspicious to me. This would seem to indicate that the Chapel runtime is attempting to communicate the content of a guard page!
Hmm, yeah that does seem suspicious.
I'm starting to think this could be a bug in chpl_comm_addr_gettable()
, which is used by some of the prefetching code for --cache-remote
. chpl_comm_addr_gettable()
looks like:
I'm not familiar with that code, but it looks like the assumption is that if some region is within the segment it is directly gettable, but that doesn't take into account guard pages. I'm also not sure if that check makes sense in the context of dynamic registration.
I need to dig into this some more and probably chat with @mppf.
Thanks for looking at at this @PHHargrove, and sorry for the false alarm if this ends up being a bug on our end.
Ideally, the prefetch code would run GETs in a way that ignore invalid memory addresses. Probably what is happening here is just that the GASNet configuration uses guard pages (b/c not huge pages) even though ugni normally does not. For the short term, I think we should make the caching code disable prefetching when guard pages are enabled (it should already disable prefetching if there is no defined segment e.g. with segment=everything).
With
--cache-remote
there are 7 failures forrelease/examples runtime/configMatters
under gasnet-ibv-large.Here's the simplest reproducer I have so far:
On master, it fails with:
which looks like https://gasnet-bugs.lbl.gov/bugzilla/show_bug.cgi?id=3338. I think that bug was resolved with the latest gasnet release. Upgrading to 2019.9.0 I see a different error:
which I think is https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=4008
Running with
GASNET_USE_ODP=0
does not change the failure mode.I need to investigate why we're only seeing these failures with
--cache-remote
.