chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 419 forks source link

Feature request: ref-to-local #16157

Open bmcdonald3 opened 4 years ago

bmcdonald3 commented 4 years ago

This is an argument for having a local ref feature in Chapel.

When implementing a mink reduction in Arkouda, which is basically just a reduction to return the minimum k values from an array, I was seeing some interesting behavior when using ref vs var to perform my operations. In short, when I ran the exact same block of code with only the variation shown below, I was seeing significant performance degradation, presumably to the current "wide-references" that are used.

Deep copy (where v1._data is a default flat array):

var first = v1._data;
var second = v2._data;

Reference (where v1._data is a default flat array):

ref first = v1._data;
ref second = v2._data;
16-locales execution time in seconds on XC: k val copy (s) ref (s)
10,000 0.20 2.33
100,000 0.48 13.74
1,000,000 2.31 133.05
Single-locale execution time on chapcs (illustrates how when there is no communication this is actually an improvement): k val copy (s) ref (s)
1,000,000 5.43 3.51

This is a huge performance difference, which a user (me before I talked to Brad) would expect the ref to perform better since it is avoiding the copies. I think this is an example where having something like local ref x = y would improve this situation and avoid some array deep copies as well as make things more clear.

The code that was used to gather these results is located here, with the test I ran being merge.chpl (compiled with make test-bin/merge) and the code that illustrates the issue located here I apologize that there is a lot of unnecessary code in the example provided, I tried to distill it down, but was running into some difficulty with the testing. I think that pretty much any user-defined reduction would see the same results though.

bradcray commented 4 years ago

A couple of other thoughts here:

(a) it seems like there's the potential for the compiler to automatically optimize such cases as well, so I think of a local ref feature as being more of a stopgap or expert's tool than an ideal solution (tagging @e-kayrakli as this seems like the kind of compiler-driven optimization that he might enjoy with potentially significant payoff).

(b) this case is a bit interesting in that it's a reference to an array, so there's a bit of a question about whether the local is asserting that all the array's elements are local or that the array descriptor is local (where, in this case, the array is a default array, so the two are one and the same, but for a block array...).

(c) the magnitude of these performance differences caught me by surprise even understanding the root cause.

e-kayrakli commented 4 years ago

Is the issue that when we are accessing via the ref version, we are not using the privatized instance, because in the initial definition of the ref variable we get a wide pointer to the main copy of data that's likely sitting on LOCALE0?

bradcray commented 4 years ago

@mcdobe100 should correct me if I'm wrong, but I believe in his program all the _data fields are local default arrays, so there's no privatized instance since they're not distributed (right?). Also v1 and v2 are local records. So the idea is that a default array field in a local record is local (and could be proven to be with some effort by the compiler), so ought to be able to be referenced directly with a pointer. I think the crux of the bad behavior is that we just always make references wide by default from a "correctness first, performance optimizations later" point of view and have never come back to tune cases like this. In some cases we've introduced developer-oriented tricks to optimize such cases, but I don't think we have anything a user could similarly utilize to tune their code.

For the "what if it were a Block array?" case that I brought up, things are less clear. It seems like local ref would still be accurate if we were on the node where the record existed. But if we were on another locale, you could either imagine saying "that field is not local, so a local ref isn't reasonable." Or you could imagine the compiler saying "this is a privatized field, so I could still do some optimization to refer to it locally rather than back through the original record." That makes this feel more similar to #10160 than I'd realized before... though again, I don't think that's the case @mcdobe100 is worried about today.

e-kayrakli commented 4 years ago

should correct me if I'm wrong, but I believe in his program all the _data fields are local default arrays, so there's no privatized instance since they're not distributed (right?)

The original post says otherwise. Besides, if that was the case, I would expect similar hit in a single-locale run and (just like you) I wouldn't expect to see such a big hit in performance.

That's why I was thinking there is some actual communication going on when you have a ref variable, and that may be explained by not using privatized instances.

I think the crux of the bad behavior is that we just always make references wide by default from a "correctness first, performance optimizations later" point of view and have never come back to tune cases like this. In some cases we've introduced developer-oriented tricks to optimize such cases, but I don't think we have anything a user could similarly utilize to tune their code.

Thanks for this explanation. I wasn't aware of this behavior. It'd be interesting to see what we can do with compiler optimizations to mitigate that. Although, it feels challenging to do an analysis that says "this symbol is local" with a good level of confidence.

I also wanted to express some hesitancy towards having local ref. For one, I agree that it should be a power-user feature. But more than that, I am not sure what that would mean exactly. Does it mean narrow pointer (like a C++ reference, I guess)? Does it mean that accesses on it are always local?

bmcdonald3 commented 4 years ago

Ahh, you are right @e-kayrakli I had said that they were block distributed in the original post, but they are actually local default arrays like @bradcray had mentioned, I have updated the original post to reflect that, sorry for any confusion that caused.

Besides, if that was the case, I would expect similar hit in a single-locale run and (just like you) I wouldn't expect to see such a big hit in performance.

This is true that the same performance hit isn't seen on single locale (in fact just the opposite). I guess from a user standpoint not knowing much about the compiler, it just is confusing to me that a local copy of a local array acts any differently than the actual local array.

e-kayrakli commented 4 years ago

Ahh, you are right @e-kayrakli I had said that they were block distributed in the original post, but they are actually local default arrays like @bradcray had mentioned, I have updated the original post to reflect that, sorry for any confusion that caused.

The performance degradation is very surprising, then. One initial step of understanding what's going on wrong can be to put everything that uses those ref's in a local block which should make all the wide references narrow and hopefully improve the performance. But I agree that we should be able to do that without local blocks.

bradcray commented 4 years ago

I would expect similar hit in a single-locale run

I was interpreting the single-locale performance numbers as meaning CHPL_COMM=none, which would never widen the references, so end up being much cheaper. @mcdobe100, am I correct in that, or are these CHPL_COMM!=none but with -nl 1?

Does it mean narrow pointer (like a C++ reference, I guess)? Does it mean that accesses on it are always local?

I think these are the right questions to wrestle with from a design perspective. In this context, I think the former is what we want (in some way... it doesn't have to be this specific proposal). I'm imagining that when a local ref by this definition crosses an on-clause, it becomes a normal wide reference.

Although, it feels challenging to do an analysis that says "this symbol is local" with a good level of confidence.

We'd obviously have to be conservative, like with any compiler optimization, but I think there's definitely low-hanging fruit here. For example:

var x: int;
ref a = x;  // obviously local

var r: myPodRecord;
ref b = r;  // obviously local
ref c = r.x;  // obviously local

var myArray: [1..10] real;
ref d = myArray;  // maybe not obviously local(?), but an important case to optimize for; or maybe we say that since the array was declared on our same locale, it is obviously local whatever the distribution as a starting point?

and then recursive cases of these (e.g., non-POD record with a field that was an int, or POD, or a DR array.

The reason I think this is important is that ref can be a really nice way of introducing a shorthand for an otherwise complex expression in the language, so it shouldn't become a performance killer. As motivation, look at the generated code for this program:

var x: int;
ref y = x;
y = 42;

for CHPL_COMM=none vs. != none.

bmcdonald3 commented 4 years ago

@bradcray You are right in that, the numbers that saw the improvement were ran with CHPL_COMM=none, not with -nl 1, should have made that more clear.