Optimization: Start remote gets early

bradcray commented 3 years ago

This captures an optimization idea that we discussed in the performance meeting today while looking at Greg's recent improvements for remote puts and thinking about what it would take to get a similar performance jump for remote gets. @ronawho pointed out that gets are different because you tend to need their data before you can go on. This made me wonder whether we could do a compiler-driven optimization in which remote gets are made non-blocking and moved back in the source until they're unable to due to:

hitting a write of the same variable or, more likely, one of:
hitting a write that we have to conservatively assume is the same variable, or
hitting a memory consistency event that prevents moving it back further.

As a really trivial / uninspired example, in code like the following:

var A = newBlockArr({1..n}, real);

A = ...;

writeln("I'm about to print A[i]");

writeln(A[i]);

you could imagine that the get corresponding to A[i] in the final line to be replaced with a wait with a non-blocking get moved back to the point right before A was assigned. That is, rather than doing the following pseudo-code:

...chapel
writeln("I'm about to print A[i]");
var tmp = get(A[i]);
writeln(tmp);

we would do:

...
A = ...;
var (tmp, handle) = get_nb(A[i]);

writeln("I'm about to print A[i]");

wait(handle);
writeln(tmp);

We did a similar optimizations to this in ZPL, though it was a much simpler language to do the def-use analysis on without being overly conservative.

gbtitus commented 3 years ago

Yes, (re)scheduling long-latency operations earlier can often do a lot of good. Of course you do need to be able to do those ops in a non-blocking way (as you discussed), or all you've done is moved the latency rather than provided yourself an opportunity to do other things while it passes.

The first thing I worked on at Cray (CRI) was the instruction scheduler in the Ada codegen, and On an X-MP by far the most important instructions to move so they could start earlier were scalar loads. There were no caches, so despite that all loads were local they had much longer latency than anything else, even divides which in turn had way more latency than whatever was in 3rd place. That situation was simpler than this one because you didn't have to check for completion -- the processor pipeline did that. But otherwise it was the same problem.

Surely there's a way to arrange for LLVM to do this, though, rather than doing it ourselves and having to write all the supporting gunk like use-def and alias and loop invariance analyses (because you really want to hoist these out of loops if you can). We might have to leave remote loads looking like regular loads but with some kind of "high latency" flag on them, and convert them into comm layer GET function calls after codegen rather than before, but surely that shouldn't be too hard. Didn't @mppf already do something related to this with LLVM memory kind identifiers or something like that? (I can't remember the exact right term here, but hopefully "memory kind identifiers" will trigger someone's recall.)

bradcray commented 3 years ago

My takeaway from Greg's comment is that we should invent a time machine because then-Brad (or maybe then-Sung?) and then-Greg would've been all over this with no problems. :D

Good point about Michael's LLVM-based work on memory segments being really thematically similar to this, I'd forgotten about that. It seems to me like it might be a classic case of Chapel being able to potentially be less conservative since it knows more about the source of the addresses and communications than LLVM probably would, but if LLVM did a good-enough job, it'd be nice not to have to worry about implementing this...

mppf commented 3 years ago

I haven't read all of the above yet but I had always imagined that the compiler would add prefetch calls into --cache-remote to handle this. Doing it that way allows the compiler to ignore the analysis problem - for the most part. Instead, the compiler needs some kind of heuristic to try to add the prefetching in cases that matter and to guess what prefetch distance to use.

mppf commented 3 years ago

That situation was simpler than this one because you didn't have to check for completion -- the processor pipeline did that. But otherwise it was the same problem.

With --cache-remote the cache itself will handle checking for conflicts, fences, and completion. This makes it much easier because the compiler need only add prefetch calls.

Surely there's a way to arrange for LLVM to do this, though, rather than doing it ourselves and having to write all the supporting gunk like use-def and alias and loop invariance analyses (because you really want to hoist these out of loops if you can). We might have to leave remote loads looking like regular loads but with some kind of "high latency" flag on them, and convert them into comm layer GET function calls after codegen rather than before, but surely that shouldn't be too hard. Didn't @mppf already do something related to this with LLVM memory kind identifiers or something like that? (I can't remember the exact right term here, but hopefully "memory kind identifiers" will trigger someone's recall.)

Yes, we use address space 100 to mark "global pointers" and then run existing LLVM optimizations. Loads/stores on address space 100 pointers represent GETs and PUTs and as a result the existing LLVM optimizations can do loop invariant code motion etc. This is activated with --llvm --llvm-wide-opt. So I would say that we already do what you have described above.

Good point about Michael's LLVM-based work on memory segments being really thematically similar to this, I'd forgotten about that. It seems to me like it might be a classic case of Chapel being able to potentially be less conservative since it knows more about the source of the addresses and communications than LLVM probably would, but if LLVM did a good-enough job, it'd be nice not to have to worry about implementing this...

I'm not sure what Chapel would know about these load-as-GET or store-as-PUT operations that LLVM cannot. We already have Chapel hint to LLVM information about (for example) arrays that don't alias and Chapel's specific types and how they impact alias analysis. So the LLVM strategy here would be to add whatever information is necessary for good optimization to the LLVM IR.

Having said all that, we might need to add to the existing LLVM optimizations with one of our own that does something special. We already do that in --llvm-wide-opt to handle aggregating loads/stores to adjacent remote memory locations. There is also some work from Akihiro to add LLVM optimization that can do more of a bulk transfer that is described in the paper about LLVM optimizations of PGAS programs. We could engage with Akihiro to bring some of that into the main repository (but I am pretty sure it will need tightening up to become production-quality).

Now, I know of some existing work that is an LLVM pass that adds prefetch hints:

https://www.cl.cam.ac.uk/~sa614/papers/Software-Prefetching-CGO2017.pdf

In 2017 I communicated a bit with the authors of that paper but we didn't end up creating a full implementation at that time. Their implementation is available here https://github.com/SamAinsworth/reproduce-cgo2017-paper under an MIT license and I believe we could build our own work using it as a starting point.

mppf commented 3 years ago

Specifically for Brad's example though I don't see how this will work.

Brad proposed translating this:

writeln("I'm about to print A[i]");
var tmp = get(A[i]);
writeln(tmp);

into this:

var tmp = get(A[i]);
writeln("I'm about to print A[i]");
writeln(tmp);

Since the writeln contains both an on statement as well as locking, I don't think it's generally possible to move the load before the writeln.

But that problem is specific to having writeln as the call between.

bradcray commented 3 years ago

Since the writeln contains both an on statement as well as locking

Good point about the locking (which introduces a memory fence, or at least I think that's what you're implying?). Do on statements currently introduce memory fences? I was thinking it was only syncs/singles/atomics, and the joins on cobegin/coforall/the sync statement.

[edit: noting again that this example was meant to serve more as an illustration rather than a compelling case; but it's also a case where knowing the higher-level semantics of the language could permit optimizations in cases where lower-level analysis couldn't. Or opportunities where the higher-level concepts might be implemented or communicate their semantics in a way that doesn't thwart lower-level optimizations].

mppf commented 3 years ago

Good point about the locking (which introduces a memory fence, or at least I think that's what you're implying?)

Yes

Do on statements currently introduce memory fences?

Yes, for the purposes of the cache-for-remote-data, an on statement behaves like a fence.

noting again that this example was meant to serve more as an illustration rather than a compelling case; but it's also a case where knowing the higher-level semantics of the language could permit optimizations in cases where lower-level analysis couldn't. Or opportunities where the higher-level concepts might be implemented or communicate their semantics in a way that doesn't thwart lower-level optimizations

I'm not sure what form such things would take but for the lock problem specifically perhaps #12306 would help (supposing that we were comfortable hoisting a load before a critical section). However that would take a stand on inter-task communication via critical sections that other languages have not.

I suppose we could simply inform the compiler that it shouldn't consider writeln to be a fence for the remote cache, but it's a little unclear to me why that is OK in this case and how it would generalize to any other case.

chapel-lang / chapel

Optimization: Start remote gets early #16670