chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.79k stars 420 forks source link

Index-neutrality of dense LinearAlgebra functions #17624

Open ben-albrecht opened 3 years ago

ben-albrecht commented 3 years ago

The LinearAlgebra library aims to be as index-neutral as possible such that users can use any offsets ~or strides~ in their matrices and vectors (Chapel arrays) that they wish. This issue summarizes the index-neutrality status of all the current dense LinearAlgebra functions as of chpl version 1.25.0 pre-release (1125df1d2b).

[edit by @bradcray : I've struck out "or strides" above for the time being because we currently don't know of a compelling use-case for using strided arrays to represent matrices, and would like to focus on getting core capabilities working without making the job harder than necessary. So instead, we'll focus on index-neutrality].

Methodology

This survey was completed by a combination of reading the source code and writing short tests to check cases that were not clear in the source. Ideally, we would like to have index-neutrality tests for every LinearAlgebra function. However, this approach was not taken for due to time-constraints. That said, we should add tests for the cases that are fixed in this effort in order to lock in the new behavior.

Terminology

A function is considered index-neutral if it meets the following criteria:

For conciseness, I call arrays with an offset of n as n-based arrays, e.g. an offset of 1 (var A: [1..3] real;) is called a 1-based array.

When I say a function assumes indices should match across ranks, that means an array with mismatched indices such as var A: [1..10, 4..12] int; will cause an error. This kind of index-neutrality is not very important for most users, but we ought to support it for completeness.

Index-neutrality survey

Matrix and Vector Factory Functions

Matrix Operations

Matrix Structure

damianmoz commented 3 years ago

For practical purposes in Dense or Sparse Linear Algebra, index neutrality is related to the offset or base, not the stride. In this era of SIMD instructions, then the aim is have algebra which needs only unit-stride indexing.

What is really needed are guidelines on how best to implement such neutrality. Are there any plans for writing such guidelines? I threw up some ideas to kick start that discussion but the feedback from that was pretty well non-existent.

Even something as simple as basic LU decomposition needs to be implemented in an index neutral fashion so it can then be reused as the primary operation to operate on blocks for a blocked and parallelized LU decomposition of a large system

ben-albrecht commented 3 years ago

For practical purposes in Dense or Sparse Linear Algebra, index neutrality is related to the offset or base, not the stride.

I agree that supporting variable stride is less important than supporting variable offset, but I see no reason not to support both and maintaining a fully index-neutral interface.

What is really needed are guidelines on how best to implement such neutrality. Are there any plans for writing such guidelines?

Yeah, I think a "recipe" for how to write index-neutral library routines in Chapel would be nice to have documented officially or unofficially somewhere. My recommendation today is to use reindex where possible, otherwise you're stuck with doing the stride/offset in the computation yourself -- of course, there's probably some more specific recommendations that one could identify in how to do that stride/offset math. Adding @bradcray to the thread, in case he has thoughts on the matter.

ben-albrecht commented 3 years ago

My recommendation today is to use reindex where possible, otherwise you're stuck with doing the stride/offset in the computation yourself -- of course, there's probably some more specific recommendations that one could identify in how to do that stride/offset math. Adding @bradcray to the thread, in case he has thoughts on the matter.

Actually, a first line of defense ought to be trying to write the algorithm such that indices are not used directly, echoing an initial comment of @bradcray's from the Discourse thread. However, when that's not possible, using a reindex is helpful.

bradcray commented 1 year ago

Returning to this topic this week, I'm tripping over @ben-albrecht's comment here:

but I see no reason not to support both and maintaining a fully index-neutral interface.

Specifically imagining the reason to be "less work for authors of the LinearAlgebra module" (to worry about supporting those cases) combined with "unclear value to end-users." Is anyone aware of practical use-cases in the field in which Linear Algebra would benefit from supporting strided matrices? @vasslitvinov did a quick survey of other L.A. systems like Matlab and didn't see any indication of support there. That's not a reason not to do it if there's a need, but if there's not, it is extra work and may just make Chapel's L.A. module look weird.

ben-albrecht commented 1 year ago

I know image data can be strided and often utilizes linear transformations, as an example. That said, I agree that it's not worth the extra effort without user(s) requesting this feature specifically.

bradcray commented 1 year ago

Ben! (for some reason I didn't think you were receiving notifications here anymore...).

I usually think of image data as being more like a strided array than a strided matrix (in the sense that I imagine the types of operations performed on it to typically be structural operations like stencil operations and the like rather than linear algebra operations like matrix multiplies). But are you saying that these linear transformations would want to call L.A.-style routines on the image array itself, treating it like a matrix?

(I also want to make clear that I'm not dead-set against supporting strided matrices in L.A. eventually or if/when there is a need for them. Just trying to determine how much to worry about them in the short-term given that the L.A. module isn't getting a ton of love or attention as it stands...)

damianmoz commented 1 year ago

My personal opinion is that supporting strided data is extra work when there is still a lot of other things to be done on which that effort seems better spent. Just my 2c.

Also, there is an (often huge) performance hit to working with strided data. Can a user afford that?

P,S, Is not strode the past tense of stride rather than strided. Even the past participle is stridden. ????

bradcray commented 1 year ago

I obviously agree with the first point. As to:

Also, there is an (often huge) performance hit to working with strided data. Can a user afford that?

It seems to me that a user who opts into striding is making this decision for themselves (e.g., sparse matrices can have higher access costs than dense matrices, but that doesn't mean a library or language shouldn't support sparse matrices; users just need to be aware of the consequences).

damianmoz commented 1 year ago

I just checked with some of the image processing gurus I know, and they pull the data out of the image to process it. Any yes, they tend to think of the image data as a stridden array rather than a stridden matrix, as each 'element' of an array might refer to data associated with a point on a 2D surface (like an aerial survey or a search region for a submarine or satellite image) or a point in a 3D region (like an ore body scan or brain from reconstituted X-ray slices) pr a point in 4D time and space, and so on. Except for trivial mathematics like the aforementioned linear transformations, they avoid any work on the image data structure itself.

damianmoz commented 1 year ago

As @bradcray perceptively noted

It seems to me that a user who opts into striding is making this decision for themselves (e.g., sparse matrices can have higher access costs than dense matrices, but that doesn't mean a library or language shouldn't support sparse matrices; users just need to be aware of the consequences).

Agreed. But wouldn't you then different routines for different scenarios. One for dense matrices (with unit stride in any dimension like LAPACK, LINPACK, PLASMA, MAGMA or FLAME), one for sparse matrices with special processing to mitigate the higher access cost of any non-zero data, and maybe one for the case of non-unit stride. Trying to make a routine for dense matrices performant for non unit-stride makes no sense to me. But then, my applications are specific to my field. My point is that for LinearAlgebra, getting index-neutrality working with unit stride across the board might seem to make more sense than attacking the non-unit stride case unless there are other reasons driving priorities.

Currently, there are no comments on consequences. For example, the transpose routine should have a disclaimer that the performance is bad for smallish matrices where smallish means 64*64 but feel free to change this based on your experience.

bradcray commented 1 year ago

But wouldn't you then different routines for different scenarios. One for dense matrices (with unit stride in any dimension like LAPACK, LINPACK, PLASMA, MAGMA or FLAME), one for sparse matrices with special processing to mitigate the higher access cost of any non-zero data, and maybe one for the case of non-unit stride.

I think my ideal L.A. library would have a routine like matVectMult() (or whatever it was called) and work with either dense or sparse matrices, where the implementation would use the static information in Chapel's type system to specialize for those different cases rather than handling them with some sort of unified piece of code. I.e., I don't see much productivity value in being forced to write denseMatDenseVectMult() vs. sparseMatDenseVectMult() vs. ...

My point is that for LinearAlgebra, getting index-neutrality working with unit stride across the board might seem to make more sense than attacking the non-unit stride case unless there are other reasons driving priorities.

Again, that's what I'm trying to say as well. I'm just not convinced that performance differences between dense and strided ranges should be a significant factor in the discussion and decision.

damianmoz commented 1 year ago

With regard to

I think my ideal L.A. library would have a routine like matVectMult() (or whatever it was called) and work with either dense or sparse matrices, where the implementation would use the static information in Chapel's type system to specialize for those different cases rather than handling them with some sort of unified piece of code. I.e., I don't see much productivity value in being forced to write denseMatDenseVectMult() vs. sparseMatDenseVectMult() vs. ...

Agreed, But stride is not a static (or a param object of a matrix) so it is a bit different. Also

const r1 = 1 .. 10 by 1;
consr r2 = 1 .. 10;

defines two different ranges, one stridden and the other not - which can be hard to pick. Or am I confused (or have the bytes got jumbled crossing the Pacific today).

bradcray commented 1 year ago

The ability for a range, domain, or array to store a stride other than 1 is static. The specific stride when it's non-unit is not. So I'm arguably betting that anyone who wants the common case of unit-stride matrices would write 1..10 rather than 1..10 by 1 as their dimension (and even if they did write the latter, my preferred proposal for the direction of the new stride enum values would still infer it to be uint stride in this case, so nothing would be lost).

All that said, this makes it sound like I'm arguing that L.A. should support non-unit-stride matrices from the get-go, which I'm not an advocate for. I'm just saying that it could without penalty to the unit stride cases.

damianmoz commented 1 year ago

I just looked up stridable and indeed, it is a param. I was focusing on the stride itself. I cannot figure out how to use it against a domain because it fails. Given this where clause,

proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 32, pa = true) where
    A.rank == 2 && A.isRectangular() && A.hasSingleLocalSubdomain()

how would I restrict the argument to be something which is not stridable please?

damianmoz commented 1 year ago

Upon rebooting my brain, I can answer my own question

proc patranspose(AT : [?D] ?R, const A : [?S] R, bs = 32, pa = true) where
    A.rank == 2 && A.isRectangular() && A.hasSingleLocalSubdomain() && !A.stridable
ghbrown commented 1 year ago

Personally, I don't mind implementing (or attempting to implement) index-neutral routines but

  1. I haven't worked on the most complicated routines yet
  2. I don't yet fully understand the performance consequences and how they might be mitigated.

Per Ben's discussion a while ago, is there benchmark problem we could use to compare: dense, index independent, and reindexed routine versions? I'd be happy to work on development and benchmarking for this, and perhaps it would give me some insight into performance.

damianmoz commented 1 year ago

I do not know any particular benchmark.

If I was designing such a development and I sadly do not have the time at the moment, I would implement a benchmark implementation which is preferably a 1-based LAPACK or similar equivalent for which there is likely heaps of documentation. A 1-based implementation is both easier read and easier to correlate against the best textbooks and technically papers and makes more sense to me. We have an extensive library of linear algebra texts and they are all written using conventional 1-based indexing.

Once you have a working baseline benchmark, then you start changing it be index-neutral focusing on readability.

Does the way we wrote patranspose as seen in #20803 make sense to you. If not, we have obviously failed badly in our aims in writing that example. In that case, we compared against the LinearAlgebra routine transpose which was a ready-made benchmark.

Then again, the LInearAlgebra routine which does LU factorization is numerically unstable so iit might be a really bad benchmark. It does not do pivoting. I think the same can be said for the QR decomposition example somebody wrote a while but I cannot find the reference. I do not mean #13893.

bradcray commented 1 year ago

@ghbrown : I would avoid using reindexing to make L.A. routines index-independent—the overheads are currently higher than anyone would likely want to pay, and I'm skeptical that they will ever be completely negligible.

If I were writing L.A. routines today, I'd generally suggest having them query relevant dimensions at the outset of the program and then use those dimensions within any loops or computations involving the dimensions, for example:

proc myLARoutine(A: [?D] ?t) {
  const (rows,cols) = D.dims();
  // compute on rows and/or cols here
}

Computations that can use rows and cols wholesale in this way should be both index-independent and efficient. Where things tend to get tricky is when rows/cols need to start being picked apart (like rows.low, rows.high and the like), because then there's a question of whether to write loops like:

for r in rows.low..rows.high-i do  // assumes stride == 1

or:

for r in rows.low..rows.high-i by rows.stride do  // general-purpose, but may add overhead if stride happens to be 1

If we decide that index-neutrality means "must support matrices with non-unit strides" then the latter would need to be written. If we decide to have L.A. routines require a stride of 1 until someone comes up with a compelling use case for non-unit stride, then the former is fine (though ideally, the routine would also assert/check that the stride truly is 1).

damianmoz commented 1 year ago

Please note that my field of application of Linear Algebra (LA) is by definition to what I use it for. So what I say might be wrong for others in their own fields of use.

@bradcray mentioned

I would avoid using reindexing to make L.A. routines index-independent. the overheads are currently higher than anyone would likely want to pay, and I'm skeptical that they will ever be completely negligible.

When there is more than one array in a parameter list to an LA routine, trying to code things in an index neutral sense without using reindex is an absolute nightmare. Even with a single matrix doing Gaussian elimination where the matrix in question is itself a submatrix of a larger matrix had my head spinning. And it still hurts every time I even think about it. I did a transpose with (and without) a reindex). Comparing the two, the overhead seems small until I get up around a 400*400 matrix and by then, memory bandwidth is starting to dominate so who cares. Using a reindex avoids having to carry offsets all around the place in that case. My code with reindex reads a lot better. I would be curious at other's experience But mine is quite positive with reindex.

Your suggestion about querying the dimensions at the start works for me, even when I then have to pick them apart. Except of course where there are multiple matrices involves, each potentially with their own domain, albeit domains consistent in size.

As for the potential need to support non-unit stride, let's look at other commercial libraries that bundle LAPACK routines as part of their offering. None that I see provide an interface to LAPACK that supports a non-unit stride so that suggests that the demand/need for the same is small. But if some Chapel user has the need and if the LinearAlgebra are written clearly enough, they should be able to grab that code and hack the code slightly to support their far-less common case. My own experience and that of my clients and colleagues suggests that non-unit stride is almost never necessary. Besides, the numerical analysis community has spent decades developing optimal algorithms that exploit unit stride between elements and has ignored anything to do with non-unit stride. I would be curious who thinks they need to use LA algorithms on data with non-unit stride They might be shooting themselves in the feet.

bradcray commented 1 year ago

@damianmoz : Regarding:

Your suggestion about querying the dimensions at the start works for me, even when I then have to pick them apart. Except of course where there are multiple matrices involves, each potentially with their own domain, albeit domains consistent in size.

In your LA experiences, do you require interactions between matrices in which corresponding indices differ? E.g., I could imagine a matrix multiplication routine requiring not just that the inner dimensions must be the same size, but that they must also use common indices. Thus, I could matmult {1..m, 1..n} by {1..n, 1..o} or {0..<m, 0..<n} by {0..<n, 1..o}, but not {1..m, 1..n} by {0..<n, 1..o} (the goal being for the implementation to have a single common range representing the inner dimension of both matrices). If a user's matrices did not meet this constraint, they could reindex before passing the matrix into the routine.

Would such simplifying assumptions ease the kind of pain you felt in the Gaussian elimination? Or are they oversimplified? (i.e., must one be able to multiply two matrices or submatrices based only on shape and not the indices used?).

Thanks, -Brad

damianmoz commented 1 year ago

@bradcray asked

In your LA experiences, do you require interactions between matrices in which corresponding indices differ? E.g., I could imagine a matrix multiplication routine requiring not just that the inner dimensions must be the same size, but that they must also use common indices.

I would ordinarily say yes. But in this era of index neutrality to which I am only now adjusting my thinking, I cannot say.

The Gaussian elimination stuff was my first attempt at index neutral programming. I was using it in a scenario where the master matrix was broken into hypermatrices and I was trying to write code which would work with the individual hypermatrices which by definition had domains which were not {1..n, 1..n} but rather {Kn+1..(K+1)n, Ln+1..(L+1)n} in general. If I attacked it now I would probably not be as stressful. We need more documented examples and probably an approach. I will post the index neutral patranspose which does not need reindex in the next few days once it has gone through a bit of QA.

As I said, I am a novice in index neutral programming of Linear Algebra routines.

The only simplification which works for all people in the small sample to which I have spoken is unit-stride. I do not think we have enough experience yet to simplify anything else. I am still waiting on the somebody to tell me when they would want to use LA routines on non-unit stride data structures. I can think of some but that is only because they are structured badly. One example I have in our work, but not related to Linear algebra, is that sometimes we think of points in 3D space, So an array of 1..N of an array of 1..3 of real, the latter for x, y and z. At other times, you want to only think of the values in one of those dimensions, say x and want then in a 1..N vector. We just rearrange the data externally so we can have unit stride. In these days of multi-level cache memory and vector processing, you need things unit stride across the board.

I did notice that some coding practices which appeared to work with 0-based indices and 1-based, then failed for more complex ranges of indices. So, I these days, I debug with weird ranges and then do final tests on the more common two.

ghbrown commented 1 year ago

@bradcray @damianmoz

Iterations

I've done a bit of performance testing on two scenarios. The first is purely iterations around overwriting of a single value. I have three nearly identical functions that vary only in stridedness: implied unit, explicit unit ( code shown below), and nonunit.

proc iterate_xunit(num_it) {
  // explicitly unit stride (by 1)
  var b : num_it.type;
  var t : Timer;
  t.start();
  for i in 1..num_it by 1 {
    b = i;
  }
  t.stop();
  writeln('iterations, explicit unit: ',t.elapsed());
}

The performance timings are

iterations, implied unit:  0.548651
iterations, explicit unit: 0.480325
iterations, nonunit:       0.729121

and we see the implied and explicit unit stride are about equally fast (though quite frequently and surprisingly the explicit unit stride is a bit faster), and the nonunit stride incurs about 50% overhead.

1-norm

Next, I chose a very simple numerical problem (1-norm of a vector) as a benchmark. In an attempt to increase apparent overhead of indexing and expose the worst case I used a small element type (uint(8)) and accumulated into a variable of the same type (overflow expected).

Here I have five nearly identical functions for different strides/approaches: element only (no indices needed), implied unit, explicit unit (code shown below), nonunit (strided indexing), and nonunit (reindexing). Of course, the "element only" approach is somewhat like reduce, and will not generally be applicable for linear algebra routines like factorizations, etc. (After looking at Brad's comment above, I may be missing a strategy where I deal with the domain "wholesale" (probably by for i in A_range?).)

proc norm_xunit(n) {
  // 1-norm using explicit unit stride
  var d : domain(1,stridable=true) = {1..n by 1};
  var A : [d] uint(8); // choose small type to expose overhead
  var t : Timer;
  var norm : uint(8) = 0; // oveflow will happen (it's okay)
  fillRandom(A);
  t.start();
  var (A_range,) = A.dims();
  var v : uint(8);
  for i in A_range.low..A_range.high by A_range.stride {
    v = A[i];
    norm += v;
  }
  t.stop();
  writeln('2-norm, explicit unit:     ',t.elapsed());
}

The performance timings (not comparable to the above timings) are

1-norm, elements:          0.176159
1-norm, implied unit:      1.60254
1-norm, explicit unit:     3.61243
1-norm, nonunit:           3.63221
1-norm, nonunit (reindex): 2.01841

Here we see:

As Damian suggested, reindexing is surprisingly low overhead and looks (from this simple test) to be the fastest option when one needs to deal with indexing of a strided array (and I concur that it's probably the most readable approach).

My plan (hopefully after some feedback on these experiments as to their validity and what might be happening in some of the more surprising places) is to apply as many of these approaches as possible to matrix-matrix multiply (GEMM), since that's only a bit more complicated and seems to have come up recently in the thread.

bradcray commented 1 year ago

One example I have in our work, but not related to Linear algebra, is that sometimes we think of points in 3D space, So an array of 1..N of an array of 1..3 of real, the latter for x, y and z. At other times, you want to only think of the values in one of those dimensions, say x and want then in a 1..N vector.

FWIW, I don't think of this as a case that needs non-unit stride from L.A. The striding you're talking about here is in the memory space rather than the logical index space, so it feels like a case where you'd want to send an array slice like A.x into the L.A. routine which would still be unit-stride from the routine's perspective.

damianmoz commented 1 year ago

@bradcray, yes, I agree with your comment. I just put it down for reference.

@ghbrown, Are you going to do a blocked version of GEMM or a more naive version. Chapel's ability to vectorise (which is still a bit rough around the edges) or do partial reductions will slow things down with GEMM. Please not the size of matrix you will you and what size cache you have?

damianmoz commented 1 year ago

@bradcray, on thinking about it further, let's keep ranges consistent, so in the matrix example, the inner ranges match as your example notes.

ghbrown commented 1 year ago

@damianmoz I was planning on a naive version to start for simplicity's sake. I will be sure to present the matrix size and my cache size with the performance results.

When trying to extract usable data on these sorts of iteration overheads would you suggest testing on matrices that fit in cache or ones that are too big for cache? If they should fit in cache, should I make sure all three matrices (AB = C) fit in cache together?

damianmoz commented 1 year ago

@ghbrown, I like your idea to keep it simple at the start.

I would suggest using small matrices which you know will fit into cache and then scaling up from there. The

config const n = <WHATEVER>

is your friend. The graph will tell you when you go out of cache at which point, the results get a little harder to interpret.

Lets hope others can offer their words of wisdom here also.

bradcray commented 1 year ago

Based on the recent discussion here and a chat with @ghbrown today, I've updated the first few paragraphs of the OP to mark strided arrays as not something we should focus on in the near-term (but are still open to longer-term if a motivating use-case comes up). We've got enough work to do to get L.A. working and performing across all routines with arbitrary indices and unit stride without having to worry about arbitrary stride in the first pass.