Accessing non-adjacent memory in a parallel statement

damianmoz commented 5 years ago

To access elements in a column of anyt real matrix (except a tiny one), I would likely do

var xcolk = x[i..j, k];
....
// say
xcolk += v[i..j] * t;
....
x[i..j, k] = xcolk;

This has the advantage that I am working with a copy of the column stored contiguously in memory. It has the copy overhead, plus twice I am accessing memory all over the place (and probably doing naughty things outside the cache) but only just the two times.

Should I instead use a ref-erence

ref xcolk = x[i..j, k]

which, while I avoid the copy out and back in, potentially means I am going all over memory whenever I am doing things with xcolk? And also, that column may not be nicely aligned on a 128-bit boundary (for AVX-128 vector instructions or similar nice alignment issues if using vectorization)

mppf commented 5 years ago

@benharsh - you might know the answer to this question

benharsh commented 5 years ago

I suspect it would be best to create xcolk as a ref because it avoids array allocation and bulk copies of the temporary array and array slice. If through measurement you find that the ref pattern is slower, I would be interested in seeing the code to learn why that is, and whether we can further optimize our arrays to avoid problems in this case.

bradcray commented 5 years ago

A couple of other thoughts on this issue:

When thinking about focusing on a lower-dimensional slice of a higher-dimensional array, you may want to consider using a rank-preserving slice x[i..j, k..k] which has slightly lower overhead than the similar rank-changing slice x[i..j, k]. However, it has the downside that since it's still 2D, it can't be used in whole-array statements that mix 1D and 2D arrays like this case would. So this is unlikely to help with this specific idiom, but I wanted to mention it while we were talking about costs related to slicing.
You could always write out the loop in the fully scalar style:
```
forall r in i..j do
x[r, k] += v[r] * t;
```
which has the downside of being (arguably) less elegant, but is likely to have the best performance if that's what matters the most (because neither array copy nor "array view" descriptor is needed).

Whether or not each approach is acceptable, performance-wise, probably depends on the values of i and j. Factoring the computation itself out of the discussion temporarily:
- Creating an array view descriptor (the array slice ref approach, effectively) costs something to build the array descriptor, but that should be O(1) for local arrays;
- Creating the copy has a similar O(1) cost plus the O(n) cost for the copy of the elements.
- The scalar approach is essentially free.
So if j-i was sufficiently large then the O(1) cost of the array slice approaches should be amortized away and the elegance benefits worth it. But if that difference was quite small, the scalar approach might start winning due to not needing to create array descriptors.
Of course, all of the above was factoring out the computation, and part of your concern is whether or not the column-wise striding is going to kill your performance in the cache or not. That's another of those calls that's likely very platform-dependent and difficult to predict (and also very dependent on the values of i and j). I've known architectures / back-end compilers that have done a reasonable job with strided accesses, but that's obviously not a given. It typically would not occur to me to do the copy in a case like this until such time as I knew that the lack of cache benefits was killing me.

I also wanted to mention that, if it's helpful to you, we've recently been working on supporting a mode in which Chapel arrays can be stored in column-major (rather than row-major order). Of course, using that mode may just give you the dual problem for your row-wise accesses if your algorithms use both, but if you had an array or phase of computation that really preferred to use column-wise storage, we could make that happen. (Currently this feature is whole-program only, but there's a next step to be able to specify it on a per-array basis. See issue #11291).

mppf commented 5 years ago

Of course, all of the above was factoring out the computation, and part of your concern is whether or not the column-wise striding is going to kill your performance in the cache or not.

It's also possible to make prefetch calls in Chapel when the CPU isn't understanding the data access pattern. The Prefetch package module should work for this.

chapel-lang / chapel

Accessing non-adjacent memory in a parallel statement #11935