Open bmcdonald3 opened 4 years ago
A couple of other thoughts here:
(a) it seems like there's the potential for the compiler to automatically optimize such cases as well, so I think of a local ref
feature as being more of a stopgap or expert's tool than an ideal solution (tagging @e-kayrakli as this seems like the kind of compiler-driven optimization that he might enjoy with potentially significant payoff).
(b) this case is a bit interesting in that it's a reference to an array, so there's a bit of a question about whether the local
is asserting that all the array's elements are local or that the array descriptor is local (where, in this case, the array is a default array, so the two are one and the same, but for a block array...).
(c) the magnitude of these performance differences caught me by surprise even understanding the root cause.
Is the issue that when we are accessing via the ref
version, we are not using the privatized instance, because in the initial definition of the ref
variable we get a wide pointer to the main copy of data
that's likely sitting on LOCALE0?
@mcdobe100 should correct me if I'm wrong, but I believe in his program all the _data fields are local default arrays, so there's no privatized instance since they're not distributed (right?). Also v1
and v2
are local records. So the idea is that a default array field in a local record is local (and could be proven to be with some effort by the compiler), so ought to be able to be referenced directly with a pointer. I think the crux of the bad behavior is that we just always make references wide by default from a "correctness first, performance optimizations later" point of view and have never come back to tune cases like this. In some cases we've introduced developer-oriented tricks to optimize such cases, but I don't think we have anything a user could similarly utilize to tune their code.
For the "what if it were a Block array?" case that I brought up, things are less clear. It seems like local ref
would still be accurate if we were on the node where the record existed. But if we were on another locale, you could either imagine saying "that field is not local, so a local ref isn't reasonable." Or you could imagine the compiler saying "this is a privatized field, so I could still do some optimization to refer to it locally rather than back through the original record." That makes this feel more similar to #10160 than I'd realized before... though again, I don't think that's the case @mcdobe100 is worried about today.
should correct me if I'm wrong, but I believe in his program all the _data fields are local default arrays, so there's no privatized instance since they're not distributed (right?)
The original post says otherwise. Besides, if that was the case, I would expect similar hit in a single-locale run and (just like you) I wouldn't expect to see such a big hit in performance.
That's why I was thinking there is some actual communication going on when you have a ref
variable, and that may be explained by not using privatized instances.
I think the crux of the bad behavior is that we just always make references wide by default from a "correctness first, performance optimizations later" point of view and have never come back to tune cases like this. In some cases we've introduced developer-oriented tricks to optimize such cases, but I don't think we have anything a user could similarly utilize to tune their code.
Thanks for this explanation. I wasn't aware of this behavior. It'd be interesting to see what we can do with compiler optimizations to mitigate that. Although, it feels challenging to do an analysis that says "this symbol is local" with a good level of confidence.
I also wanted to express some hesitancy towards having local ref
. For one, I agree that it should be a power-user feature. But more than that, I am not sure what that would mean exactly. Does it mean narrow pointer (like a C++ reference, I guess)? Does it mean that accesses on it are always local?
Ahh, you are right @e-kayrakli I had said that they were block distributed in the original post, but they are actually local default arrays like @bradcray had mentioned, I have updated the original post to reflect that, sorry for any confusion that caused.
Besides, if that was the case, I would expect similar hit in a single-locale run and (just like you) I wouldn't expect to see such a big hit in performance.
This is true that the same performance hit isn't seen on single locale (in fact just the opposite). I guess from a user standpoint not knowing much about the compiler, it just is confusing to me that a local copy of a local array acts any differently than the actual local array.
Ahh, you are right @e-kayrakli I had said that they were block distributed in the original post, but they are actually local default arrays like @bradcray had mentioned, I have updated the original post to reflect that, sorry for any confusion that caused.
The performance degradation is very surprising, then. One initial step of understanding what's going on wrong can be to put everything that uses those ref's in a local
block which should make all the wide references narrow and hopefully improve the performance. But I agree that we should be able to do that without local
blocks.
I would expect similar hit in a single-locale run
I was interpreting the single-locale performance numbers as meaning CHPL_COMM=none, which would never widen the references, so end up being much cheaper. @mcdobe100, am I correct in that, or are these CHPL_COMM!=none but with -nl 1
?
Does it mean narrow pointer (like a C++ reference, I guess)? Does it mean that accesses on it are always local?
I think these are the right questions to wrestle with from a design perspective. In this context, I think the former is what we want (in some way... it doesn't have to be this specific proposal). I'm imagining that when a local ref
by this definition crosses an on-clause, it becomes a normal wide reference.
Although, it feels challenging to do an analysis that says "this symbol is local" with a good level of confidence.
We'd obviously have to be conservative, like with any compiler optimization, but I think there's definitely low-hanging fruit here. For example:
var x: int;
ref a = x; // obviously local
var r: myPodRecord;
ref b = r; // obviously local
ref c = r.x; // obviously local
var myArray: [1..10] real;
ref d = myArray; // maybe not obviously local(?), but an important case to optimize for; or maybe we say that since the array was declared on our same locale, it is obviously local whatever the distribution as a starting point?
and then recursive cases of these (e.g., non-POD record with a field that was an int, or POD, or a DR array.
The reason I think this is important is that ref
can be a really nice way of introducing a shorthand for an otherwise complex expression in the language, so it shouldn't become a performance killer. As motivation, look at the generated code for this program:
var x: int;
ref y = x;
y = 42;
for CHPL_COMM=none
vs. != none
.
@bradcray You are right in that, the numbers that saw the improvement were ran with CHPL_COMM=none, not with -nl 1
, should have made that more clear.
This is an argument for having a
local ref
feature in Chapel.When implementing a
mink
reduction in Arkouda, which is basically just a reduction to return the minimumk
values from an array, I was seeing some interesting behavior when usingref
vsvar
to perform my operations. In short, when I ran the exact same block of code with only the variation shown below, I was seeing significant performance degradation, presumably to the current "wide-references" that are used.Deep copy (where v1._data is a default flat array):
Reference (where v1._data is a default flat array):
This is a huge performance difference, which a user (me before I talked to Brad) would expect the
ref
to perform better since it is avoiding the copies. I think this is an example where having something likelocal ref x = y
would improve this situation and avoid some array deep copies as well as make things more clear.The code that was used to gather these results is located here, with the test I ran being
merge.chpl
(compiled withmake test-bin/merge
) and the code that illustrates the issue located here I apologize that there is a lot of unnecessary code in the example provided, I tried to distill it down, but was running into some difficulty with the testing. I think that pretty much any user-defined reduction would see the same results though.