fortran-lang / stdlib

Fortran Standard Library
https://stdlib.fortran-lang.org
MIT License
1.05k stars 164 forks source link

Concurrencies in stdlib (do concurrent) #429

Open awvwgk opened 3 years ago

awvwgk commented 3 years ago

Should we use do concurrent in stdlib as parallel concurrency?

There have been a lot of discussion about the do concurrent construct in Fortran (j3-fortran, discourse), especially with respect to the question whether a concurrency should imply parallel execution, which some compilers (intel, nvidia) already support. I don't want to open a discussion about the do concurrent construct here, instead I want to discuss how we can make best use of do concurrent in stdlib.

From my experience concurrent but not parallel constructs inside do concurrent can cause issues with compilers enabling aggressive parallelization for concurrencies. I therefore suggest to only use do concurrent for parallel concurrencies, unless the locality specification is explicitly given (Fortran 2018 feature).

It is important that we test the parallel concurrencies in our continuous integration workflows, this means we actually have to compile a parallel version of stdlib and enable the compiler support for parallelization of do concurrent in our build files, which we are currently not doing.

LKedward commented 3 years ago

Would this be mutually exclusive with #213 (OpenMP)? If so I would think OpenMP is preferable due to better compiler support no? For those with less experience with do concurrent (me), are you able to elaborate on what you mean by parallel and non-parallel concurrencies in do concurrent with examples and the issues you mentioned with the latter? Another basic question (sorry): is do concurrent deterministic, since this will affect whether we can effectively test it in our CI?

awvwgk commented 3 years ago

I think for our purpose concurrencies and OpenMP are in principle similar and therefore more or less mutually exclusive.

So far it seems like I was only able to explore how do concurrent doesn't work in my projects (I tried using it with atomic updates in a reduction and in combination with OpenMP parallel loops, both didn't work out that great). To be fair, I haven't invested the time to port one of my larger applications completely to do concurrent yet. Therefore, I can't say how it would compare in practice to OpenMP or OpenACC parallelized code.

Given that OpenMP offers more flexibility than just parallelization of loops (like tasks, sections, ...) and coarrays/collectives are still incompatible with library applications (require a Fortran main), OpenMP seems to be the indeed most appealing choice.

arjenmarkus commented 3 years ago

Just my two cents here:

The compiler is supposed to be very conservative with DO CONCURRENT. That is, unless it can prove that each iteration is independent of the other iterations, DO CONCURRENT will become an ordinary DO-loop with slightly different syntax. One thing that prevents parallellism in this case is a write-statement.

Op ma 7 jun. 2021 om 16:21 schreef Laurence Kedward < @.***>:

Would this be mutually exclusive with #213 https://github.com/fortran-lang/stdlib/issues/213 (OpenMP)? If so I would think OpenMP is preferable due to better compiler support no? For those with less experience with do concurrent (me), are you able to elaborate on what you mean by parallel and non-parallel concurrencies in do concurrent with examples and the issues you mentioned with the latter? Another basic question (sorry): is do concurrent deterministic, since this will affect whether we can effectively test it in our CI?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/fortran-lang/stdlib/issues/429#issuecomment-855971867, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAN6YRYRXFRSFLHURJ2H2O3TRTIWVANCNFSM46FU6DRQ .

awvwgk commented 3 years ago

The compiler is supposed to be very conservative with DO CONCURRENT. That is, unless it can prove that each iteration is independent of the other iterations, DO CONCURRENT will become an ordinary DO-loop with slightly different syntax

That was my initial assumption as well with do concurrent, turns out that in practice it will be parallel regardless with some compilers, which has caused me a lot of problems when using do concurrent so far. This kind of usage is apparently highly non-portable even if it might be standard conforming.

Therefore, I started this thread with the question if we want to use do concurrent only for parallelizable concurrencies in stdlib, i.e. for independent iterations.

milancurcic commented 3 years ago

Yes, I think we should use do concurrent only for independent iterations, and I think we should use it for all independent iterations. And put that in the style guide. :)

do concurrent is useful as code annotation for the programmer, and can help the compiler too. I don't think it's ever harmful if used carefully (in other words, in independent iterations).

With the recent announcement of nvfortran offloading do concurrent to GPUs, there will be more development like this in the near future. do concurrent will only get better as compilers improve.

awvwgk commented 3 years ago

@milancurcic do have you a project where you make use of do concurrent successfully? I am really interested to learn about ways to use it, because my track record with do concurrent are only failures so far.

milancurcic commented 3 years ago

Here are some examples: UMWM, neural-fortran. How exactly does do concurrent fail for you?

awvwgk commented 3 years ago

Thanks, this actually looks pretty straight-forward. My first encounter with do concurrent lead to issues with ifort (https://github.com/dftd4/dftd4/issues/47).

I tried do concurrent in one of my projects yesterday but the results didn't look promising, I got a major performance regression due to bad scheduling (up to factor 4 slower depending on the problem size but always slower than OpenMP).

milancurcic commented 3 years ago

Yes, I don't think you can rely on do concurrent alone to write parallel code.

For parts of the program that I need to run in parallel, I use coarrays or MPI. And that code can also have do concurrent constructs that iterate over the arrays on each image/MPI task. I don't expect these do concurrents to run any faster or slower than the regular do-loops would. If the compiler can do something special about it and make it faster, great, but I don't count on it. And in the future, some compiler may send it to the GPU, which would be great. And the code is more clear IMO: multiply-nested loops are more concise when written as do concurrent (same syntax as the obsolescent forall). A do concurrent signals that iterations are independent. Because I always use it like that, the regular do-loops signal that the iterations are not independent, so it helps me to understand the code. And when I can write a do concurrent as a whole-array arithmetic operation, I just do that as it's the most concise form.

jvdp1 commented 3 years ago

Is it a problem if a procedure that includesdo concurrent, is used in an, e.g. OpenMP, region? E.g., what happen if the compiler parallelizes the do concurrent used in stdlib and that this function/subroutine is then used inside an OpenMP do loop? If there are no problems, I am fine with using do concurrent inside stdlib. Otherwise, I think we should be careful when we aim to use do concurrent in stdlib.

ivan-pi commented 3 years ago

Here are some examples: UMWM, neural-fortran. How exactly does do concurrent fail for you?

In at least some of your loops you could also use array intrinsics, i.e.

    ! adjust input for opposing winds
    do concurrent (o = 1:om, p = 1:pm, i = istart:iend, ssin(o,p,i) < 0)
      ssin(o,p,i) = ssin(o,p,i) * fieldscale1
    end do

could be replaced with

associate(ssin_view => ssin(1:om,1:pm,istart:iend))
where (ssin_view < 0)
  ssin_view = ssin_view * fieldscale1
end where
end associate

or even with whole-array arithmetic assuming the ranges signify the whole array

where (ssin < 0)
  ssin = ssin * fieldscale1
end where

Judging by the do concurrent thread at https://github.com/j3-fortran/fortran_proposals/issues/62 it is still a controversial language element. forall was introduced in Fortran 95 and "advertised" as being more versatile than array assignment. In Fortran 2018 it was already marked as obsolescent. Following a rule like

... I think we should use it for all independent iterations. And put that in the style guide. :)

feels quite risky to me at this early stage.

awvwgk commented 3 years ago

or even with whole-array arithmetic assuming the ranges signify the whole array

Keep in mind that do concurrent implies that the concurrency could be parallelized or offloaded, while a where clause does not imply the same. I'd be curious to check how well the omp parallel workshare works in this case. Especially, in combination with an associate construct different compilers might not allow to use the associated variable in omp clauses. A concurrency seems like a more promising candidate to get shared memory parallelization or offloading in this case.

I'm very open to look more into parallel concurrencies, the possibility to avoid pragmas for shared memory parallelism seems like a very significant advantage to me. I don't see a fundamental issue with do concurrent. Maybe it is more the user expectation of parallel concurrencies, which is not yet met by the actual realization in current compiler extensions.

So far I found a few issues, which discourage the usage of this language feature for me:

At least the scheduling issue blocks the usage of do concurrent in most of my projects completely, because I seldom deal with a workload that can be effectively distributed in a static schedule. Maybe there is a way to change the scheduling, but I haven't found documentation on this yet. To be honest, none of those issues are fundamental issues, they have really straight-forward solutions, which will sooner or later be realized in the respective compiler extensions.

milancurcic commented 3 years ago

Is it a problem if a procedure that includesdo concurrent, is used in an, e.g. OpenMP, region? E.g., what happen if the compiler parallelizes the do concurrent used in stdlib and that this function/subroutine is then used inside an OpenMP do loop? If there are no problems, I am fine with using do concurrent inside stdlib. Otherwise, I think we should be careful when we aim to use do concurrent in stdlib.

@jvdp1 I don't have experience with OpenMP, but it looks like @awvwgk had issues combining OpenMP and do concurrent. It is possible that they don't play well together, although I don't understand why that would be. MPI and coarrays have no such issue. I think it's important for people to be able to use stdlib functions in OpenMP regions. If do concurrent causes issues there, then we shouldn't do it.

Judging by the do concurrent thread at j3-fortran/fortran_proposals#62 it is still a controversial language element.

It's controversial only in the context of do concurrent as a parallel feature. The rules of do concurrent do not guarantee that it can be parallelized. In other words, it's possible to write valid do concurrent loops whose iterations are not independent. But here we're discussing using do concurrent only where the programmer knows that the iterations are independent, which avoids the pitfall from that thread.

forall was introduced in Fortran 95 and "advertised" as being more versatile than array assignment. In Fortran 2018 it was already marked as obsolescent.

forall became redundant with the introduction of do concurrent.

Following a rule like

... I think we should use it for all independent iterations. And put that in the style guide. :)

feels quite risky to me at this early stage.

What do you think is risky about it?

A counter-argument: If we use it early and liberally, we'll increase its surface area (i.e. the number of users that rely on it), which will incentivize vendors to make it better (e.g. more stable, better performance, offloading to various GPUs, etc.). We're in experimental stage, so if there are problems it will be easy to go back from it.

Keep in mind that do concurrent implies that the concurrency could be parallelized or offloaded, while a where clause does not imply the same.

@awvwgk Is that really true? I think all where constructs and whole-array operations can be parallelized or offloaded.

awvwgk commented 3 years ago

Keep in mind that do concurrent implies that the concurrency could be parallelized or offloaded, while a where clause does not imply the same.

@awvwgk Is that really true? I think all where constructs and whole-array operations can be parallelized or offloaded.

Regarding OpenMP it might be a construct that works with omp parallel workshare, still the branching condition makes it kinda difficult to vectorize or offload. I haven't seen where constructs discussed in the context of concurrencies so far, in case they belong to same category, they are probably incompatible with OpenMP. This sound like something to discuss on discourse (thread).

I don't have experience with OpenMP, but it looks like @awvwgk had issues combining OpenMP and do concurrent. It is possible that they don't play well together, although I don't understand why that would be.

So far OpenMP and do concurrent don't play well together from a practical point of view. For example, you can arrive at a point where for one compiler a variable in concurrency must still be declared in the omp pragma, while for another this variable is local to the concurrency and therefore must not be declared in the omp pragma.

hsnyder commented 3 years ago

In applications where the programmer has gone to great length to parallelize code themselves, they may not want stdlib library functions to have their own internal parallelism. For example, I'm working on an application where I've carefully arranged teams of threads and pinned them to specific cores, with one such team per NUMA node on the machine. Work items are dealt out by team, with each team collaborating on a given work item. It's extremely fast, but took some careful planning. I'd be annoyed if I were calling library functions that were internally creating their own threads.

The above is perhaps not the most typical use case, but I'd be in favour of using openmp or do concurrent inside stdlib only if it can be turned off at build-time. That way, even if it's on by default, I could rebuild stdlib and disable parallelism when I want to manage that stuff myself.