Parallel linalg - Githubissues

certik commented 4 years ago

The modern Fortran API for a serial linear algebra (#10) seems natural.

How would that be extended to work in parallel using co-arrays? If there is a similar "natural" parallel API for linear algebra using modern Fortran, then that would be a good candidate for inclusion into stdlib, and we can have different backends that do the work (Scalapack, ..., perhaps even our own simpler reference implementation using co-arrays directly), that way if somebody writes a faster 3rd party library, then it could be plugged in as a backend, and user codes do not need to change, because they would already be using the stdlib API for parallel linear algebra.

certik commented 4 years ago

@zbeekman you have a lot of experience with co-arrays, is there a way to do this?

jvdp1 commented 4 years ago

The modern Fortran API for a serial linear algebra (#10) seems natural.

Would this API also include shared-memory parallelization? Especially if it is based on BLAS/LAPACK.

certik commented 4 years ago

Would this API also include shared-memory parallelization?

In the above I was thinking of distributed memory parallelization (MPI, co-arrays, ...).

What are the options for shared-memory parallelization in Fortran? I am aware of do concurrent and openmp. It seems to me, and I could be wrong, that in terms of utility, the distributed memory is the most useful. It can still be run on shared-memory computer (i.e., a single node), but it can also be run on an HPC cluster. Most of the codes that I have been working with use MPI, but rarely they use OpenMP. That being said, I did write an OpenMP version of CSR matmul in my code and it gives about 2x to 4x speedup on 32 cores.... So terrible performance, but expected, since it is memory bound. I do not have an MPI version of CSR matmul, but it would run faster I would expect, due to the memory being distributed (on each core).

zbeekman commented 4 years ago

In general, one should be able to implement parallel LA algorithms using coarrays. The coarray implementation may be shared memory, distributed memory, hybrid, etc. the standard doesn't specify. Part of the point of coarrays is to have a simpler API and programming model that can be divorced from the underlying implementation.

The trickier question is, perhaps, what should the interface look like? How much ownership and control should the client code have over the objects? Should the user create and pass coarrays? Or should there be a global array view that makes it appear as though you're working with normal arrays?

Last I checked there were some non-trivial issues with the coarray specification in the standard that makes them challenging or impossible to use in some applications, especially computations on unstructured meshes and some other graph and graph-like algorithms. I don't recall the details and I believe Salvatore Filippone (PSBLAS author) submitted a proposal to J3 to resolve it, or at least to highlight the issue in the standard.

Intel provides a shared-memory coarray implementation on some platforms with some licenses, if I remember correctly. I think without parallel studio cluster edition, the Intel Fortran compiler has a shared memory coarray implementation. If you have the license for cluster edition I think that unlocks the MPI back end (or at least the SDK/compile time stuff).

Using coarrays is nice because it abstracts away the backend. OpenCoarrays main backend is MPI, but we have an experimental/partial one based on OpenSHMEM, and at one point in the past we were using GASNet. So I think coarrays are a natural and good choice for parallelism, but a few issues remain:

Support from compiler vendors, especially for a bunch of things like events and collectives for stuff that didn't make it into the 2008 standard
Outstanding issues with asymmetric coarrays as I alluded to above

OpenMP is nice because of its built in conditional compilation and support for GPUs and accelerators. Thread affinity and avoiding other threading issues is certainly tricky, however.

ivan-pi commented 4 years ago

The book by Numrich - Parallel Programming with Co-arrays discusses an API for both sparse and dense linear algebra using co-arrays.

I know that for PSBLAS they recently developed a co-array backend. A recent article discusses the topic (a draft is available somewhere on GitHub).

zbeekman commented 4 years ago

If we can use or adopt parts of PSBLAS that would be nice, rather than reinventing the wheel.

certik commented 4 years ago

@zbeekman I was lead to believe at the latest J3 meeting that co-arrays can be used today with GFortran, Intel and Cray for anything that MPI can be used, including unstructured meshes (that was my first question to them). But I haven't used co-arrays myself yet.

My understanding is also that you can mix and match co-arrays with MPI, is that correct?

I would go ahead and try to figure out what the API should look like using co-arrays, and if we like it, we can work towards putting it into stdlib. If we can't agree on a good way due to fundamental limitations of co-arrays, then let's submit proposals to the J3 committee to fix it.

I would think exposing co-arrays directly to the user would be the natural way lowest level API, similarly to the serial linalg API that just operates on arrays. Then, we can always see if there is some optional good higher level API, whether object oriented, or some global object (state?), similarly to how there can be an optional OO API on top of the serial linalg. Let's brainstorm this more on some example.

@ivan-pi thanks for the pointers --- both links contain very useful info. They have done a lot of thinking about this, so we should see if we can use their API.

jvdp1 commented 4 years ago

Most of the codes that I have been working with use MPI, but rarely they use OpenMP. That being said, I did write an OpenMP version of CSR matmul in my code and it gives about 2x to 4x speedup on 32 cores....

@certik I usually rely on Sparse BLAS for such operations (http://www.netlib.org/utk/people/JackDongarra/etemplates/node381.html), mainly with the MKL version.

jvdp1 commented 4 years ago

The book by Numrich - Parallel Programming with Co-arrays discusses an API for both sparse and dense linear algebra using co-arrays.

I think it would be a good start. @ivan-pi Do you know if the library on which the book is based, is available somewhere? Many articles by Numrich mentioned it, but I am not sure if it has been released.

ivan-pi commented 4 years ago

@ivan-pi Do you know if the library on which the book is based, is available somewhere? Many articles by Numrich mentioned it, but I am not sure if it has been released.

I have not found the library anywhere and the book also doesn't offer any link. The book mostly contains only the subroutine prototypes and a description of the variables and some discussion of the API design.

zbeekman commented 4 years ago

@zbeekman I was lead to believe at the latest J3 meeting that co-arrays can be used today with GFortran, Intel and Cray for anything that MPI can be used, including unstructured meshes (that was my first question to them). But I haven't used co-arrays myself yet.

Yes, this is more or less true. However, I don't remember the particular issue, however I recall that @sfilippone found a subtlety with the standard that caused a large headache/impediment in realizing more complex data structures/machinery needed for unstructured meshes. I cannot immediately recall the details. Maybe the OpenCoarrays repo has issues discussing this or maybe Salvatore can remind me here.

My understanding is also that you can mix and match co-arrays with MPI, is that correct?

Yes, in theory this should be true. One complication is that if coarrays are implemented via MPI, then the compiler provided Fortran runtime is responsible for initializing MPI. This may not be ideal in certain situations. I think we implemented a configure time option in OpenCoarrays to return the global communicator to the user or delay MPI_init() and let the user call it. I'd have to double check.

sfilippone commented 4 years ago

Hi there Zaak is correct, there is a problem with the standard. The problem arises as soon as you want to have a coarray component of a derived type: if you have a component in a derived type, which itself may be in a derived type, etc. you have a hierarchy of "container" objects which ultimately includes a coarray. With the current standard, it is forbidden for any of the containers to be ALLOCATABLE (whereas the coarray itself is pretty much forced to be allocatable). This implies that the set of entities that may either be a coarray or contain a coarray componet has to be fixed at compile time. I have proposed a change in the standard to lift this restriction; I did not attend the latest meetings of the committe, but my colleague Damian Rouson who coauthored the proposal did attend, and as far as I understand the proposed change was approved. How long until it is supported in compilers, I have no idea.

Hope this helps Salvatore

On Mon, Jan 6, 2020 at 9:40 PM zbeekman notifications@github.com wrote:

@zbeekman https://github.com/zbeekman I was lead to believe at the latest J3 meeting that co-arrays can be used today with GFortran, Intel and Cray for anything that MPI can be used, including unstructured meshes (that was my first question to them). But I haven't used co-arrays myself yet.

Yes, this is more or less true. However, I don't remember the particular issue, however I recall that @sfilippone https://github.com/sfilippone found a subtlety with the standard that caused a large headache/impediment in realizing more complex data structures/machinery needed for unstructured meshes. I cannot immediately recall the details. Maybe the OpenCoarrays repo has issues discussing this or maybe Salvatore can remind me here.

My understanding is also that you can mix and match co-arrays with MPI, is that correct?

Yes, in theory this should be true. One complication is that if coarrays are implemented via MPI, then the compiler provided Fortran runtime is responsible for initializing MPI. This may not be ideal in certain situations. I think we implemented a configure time option in OpenCoarrays to return the global communicator to the user or delay MPI_init() and let the user call it. I'd have to double check.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/67?email_source=notifications&email_token=AD274T6G5BMYJDDJ2XAOAN3Q4OQN5A5CNFSM4KCE7XVKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIG4CQY#issuecomment-571326787, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD274T3GV3PU6RFR63ANHD3Q4OQN5ANCNFSM4KCE7XVA .

fortran-lang / stdlib

Parallel linalg #67