FEST3D / FEST-3D

The source code for Finite volume Explicit STructured 3Dimensional (FEST-3D) solver.
https://fest3d.github.io/index.html
GNU General Public License v3.0
35 stars 16 forks source link

Please consider updating to Fortran 2018 #7

Open rouson opened 4 years ago

rouson commented 4 years ago

Numerous aspects of the coding style and structure in this repository are archaic. There have been four revisions of the Fortran standard in the three decades since Fortran 90 and two revisions of the MPI standard in the more than two decades since MPI-1. Exploiting newer Fortran standards could make FEST-3D code more clear, more concise, potentially more robust, and might even aid performance.

The list of updates I could suggest grows with every file I open so the following list is far from exhaustive:

  1. Declarations such as integer(kind=4) are not portable across compilers and not even guaranteed to be the same for the same compiler across different platforms or different compiler versions. If a specific precision is desired, the selected_int_kind() function is the preferred method. If a specific width in memory is desired, the more recent standards provide several kind parameters that can be accessed via use association (e.g., use iso_fortran_env, only : int32).

  2. In many places, nested loops could be collapsed to a single line using an array statement. Besides the benefits in clarity and conciseness some compilers actually optimize array statements better than do loops. This stems partly from the fact that do loops are inherently serial even though compilers might choose to vectorize the loops at high levels of optimization. Array statements have been available since Fortran 90. On a related note, I recommend reviewing modern Fortran's rich set of intrinsic functions for manipulating arrays (merge, pack, findloc, etc.). These functions can often be used to collapse multiple statements into one and might perform better by letting the compiler choose the underlying algorithm.

  3. Many loops that can't be converted to array statements can be made concurrent. This both has potential performance benefits (again because it doesn't impose unnecessary ordering of loop iterations) and it makes nested loops more concise because they can be collapsed with a syntax such as do concurrent( i=1:nx, j=1:ny, k=1:nz). Do concurrent has been available since Fortran 2008.

  4. Building MPICH with gfortran version 9 or higher installs the mpif08 module that provides a more modern interface to MPI, starting with using use mpif08 rather than accessing MPI via #include. Despite the module name, this actually uses a feature set that entered the language in Fortran 2018.

  5. Fortran 2003 added support for deferred-length character variables declared as characer(len=:), allocatable :: string, obviating the need for fixed-length character variables in many places.

  6. Even since Fortran 90, the language has supported derived types, which can be used to package related data and thereby reduce the length of argument lists.

  7. What is most shocking, however, is the number of subroutines in FEST-3D that have no arguments. This combined with the long lists of use statements suggests that a great number of the variables are widely accessed outside the module in which they are declared. This coding style can quite easily lead to a difficult-to-decipher interconnectedness of data dependancies that are truly daunting for anyone new to the code.

  8. This code has dozens of explicit deallocate statements. Are these necessary? If the memory was allocated via an allocatable variable, then the compiler is obligated to free the memory when the variable goes out of scope. In general, it's best to let the compiler do it's job unless you have data demonstrating that your explicit deallocation improves performance or significantly reduces the overall memory footprint. (Even if you deallocate memory, I don't think the compiler is obligated to return the memory to the operating system until the program ends anyway.)

  9. This code makes considerable use of pointer variables. Any large Fortran project that makes extensive use of pointer variables should articulate a very clear policy for how the pointer variables are to be used, including how the project will avoid memory leaks and dangling pointers. In almost all circumstances, if a pointer is being used to allocate memory, for example, it should be replaced by an allocatable variable (scalar or array) and a reference-counting scheme should be incorporated and enforced. It appears that the target variables in FEST-3D are all allocatable arrays. Given that you're doing explicit deallocations, what is your strategy for ensuring that no pointer is ever associated with a deallocated or unallocated array? Extensive use of pointer variables raises many questions that should be addressed thoroughly in documentation.

  10. Lastly, I'm very curious about your choice to use two-sided blocking MPI_Send and MPI_Recv. Is there a reason for not using non-blocking MPI_Isend and MPI_IRecv? More importantly, why use MPI at all in a new project? Fortran has been parallel since the 2008 standard and gfortran supports all Fortran 2008 parallel features as well as most Fortran 2018 parallel features. The gfortran parallel runtime library (OpenCoarrays) uses the newer one-sided MPI-3 MPI_Put and MPI_Get, which can outperform even non-blocking MPI in some applications. So besides not having to write your own MPI, you'll get more advanced MPI on the back-end if you let the compiler and runtime library generate the MPI calls for you.

rouson commented 4 years ago

I updated the comment above to eliminate several duplicate items that were caused by an inadvertent copy and paste error.

rouson commented 4 years ago

Other Fortran intrinsic functions that will likely greatly reduce the side of your code are count and sum. Please familiarize yourself with these and use them liberally unless you have a very strong motivation for doing otherwise such as solid performance data showing they slow down your application. For example, subroutine apply_interface appears to be not much more than hundreds of lines of code blocks comprised of loops nested 4 levels deep for no purpose other than computing a sum that could presumably be computed in a single line of code by invoking the sum function. You could achieve an order-or-of-magnitude reduction in code length for this subroutine alone. The change has important ramifications:

  1. Code clarity: readers take more time to figure out what you're doing when they have to read 10 lines of code versus 1 -- especially because the reader will likely give you the benefit of the doubt and assume there is some good reason you chose the more verbose approach and will spend time trying to figure out the reason (I can't find one in this case).
  2. Speed: As mentioned elsewhere, loops are explicitly serial by definition. At high levels of optimization, a smart compiler might recognize the pattern in your code and replace it with faster code (likely by invoking sum), but it's generally better to give the compiler a high-level description of what you want (an arithmetic sum) and let the compiler choose the fastest way, which often could involve vectorization in this case and vectorization is a form of parallelism.

People often talk about how to parallelize an algorithm. I frequently point out that so much of nature and even so much of what we're doing when programming is inherently parallel so the more important question is "Why serialize the algorithm?"

jayten commented 4 years ago

Numerous aspects of the coding style and structure in this repository are archaic. There have been four revisions of the Fortran standard in the three decades since Fortran 90 and two revisions of the MPI standard in the more than two decades since MPI-1. Exploiting newer Fortran standards could make FEST-3D code more clear, more concise, potentially more robust, and might even aid performance.

The list of updates I could suggest grows with every file I open so the following list is far from exhaustive:

1. Declarations such as `integer(kind=4)` are not portable across compilers and not even guaranteed to be the same for the same compiler across different platforms or different compiler versions.  If a specific precision is desired, the `selected_int_kind()` function is the preferred method.  If a specific width in memory is desired, the more recent standards provide several kind parameters that can be accessed via `use` association (e.g., `use iso_fortran_env, only : int32`).

2. In many places, nested loops could be collapsed to a single line using an array statement.  Besides the benefits in clarity and conciseness some compilers actually optimize array statements better than `do` loops.  This stems partly from the fact that `do` loops are inherently serial even though compilers might choose to vectorize the loops at high levels of optimization.  Array statements have been available since Fortran 90.  On a related note, I recommend reviewing modern Fortran's rich set of intrinsic functions for manipulating arrays (`merge`, `pack`, `findloc`, etc.).  These functions can often be used to collapse multiple statements into one and might perform better by letting the compiler choose the underlying algorithm.

3. Many loops that can't be converted to array statements can be made `concurrent`.  This both has potential performance benefits (again because it doesn't impose unnecessary ordering of loop iterations) and it makes nested loops more concise because they can be collapsed with a syntax such as `do concurrent( i=1:nx, j=1:ny, k=1:nz)`.  `Do concurrent` has been available since Fortran 2008.

4. Building MPICH with `gfortran` version 9 or higher installs the `mpif08` module that provides a more modern interface to MPI, starting with using `use mpif08` rather than accessing MPI via `#include`.  Despite the module name, this actually uses a feature set that entered the language in Fortran 2018.

5. Fortran 2003 added support for deferred-length character variables declared as `characer(len=:), allocatable :: string`, obviating the need for fixed-length `character` variables in many places.

6. Even since Fortran 90, the language has supported derived types, which can be used to package related data and thereby reduce the length of argument lists.

7. What is most shocking, however, is the number of `subroutine`s in FEST-3D that have _no_ arguments. This combined with the long lists of `use` statements suggests that a great number of the variables are widely accessed outside the `module` in which they are declared.  This coding style can quite easily lead to a difficult-to-decipher interconnectedness of data dependancies that are truly daunting for anyone new to the code.

8. This code has dozens of explicit `deallocate` statements.  Are these necessary?  If the memory was allocated via an `allocatable` variable, then the compiler is obligated to free the memory when the variable goes out of scope.  In general, it's best to let the compiler do it's job unless you have data demonstrating that your explicit deallocation improves performance or significantly reduces the overall memory footprint. (Even if you `deallocate` memory, I don't think the compiler is obligated to return the memory to the operating system until the program ends anyway.)

9. This code makes considerable use of `pointer` variables.  Any large Fortran project that makes extensive use of `pointer` variables should articulate a very clear policy for how the `pointer` variables are to be used, including how the project will avoid memory leaks and dangling pointers.  In almost all circumstances, if a `pointer` is being used to allocate memory, for example, it should be replaced by an `allocatable` variable (scalar or array) and a reference-counting scheme should be incorporated and enforced.  It appears that the `target` variables in FEST-3D are all `allocatable` arrays. Given that you're doing explicit deallocations, what is your strategy for ensuring that no `pointer` is ever associated with a deallocated or unallocated array?  Extensive use of `pointer` variables raises many questions that should be addressed thoroughly in documentation.

10. Lastly, I'm very curious about your choice to use two-sided blocking `MPI_Send` and `MPI_Recv`.  Is there a reason for not using non-blocking `MPI_Isend` and `MPI_IRecv`?  More importantly, why use MPI at all in a new project?   Fortran has been parallel since the 2008 standard and `gfortran` supports all Fortran 2008 parallel features as well as most Fortran 2018 parallel features.  The `gfortran` parallel runtime library ([OpenCoarrays](https://github.com/sourceryinstitute/opencoarrays)) uses the newer one-sided MPI-3 `MPI_Put` and `MPI_Get`, which can outperform even non-blocking MPI in some applications.  So besides not having to write your own MPI, you'll get more advanced MPI on the back-end if you let the compiler and runtime library generate the MPI calls for you.

@rouson

  1. The module which uses integer(kind=) syntax is already not being used by any other module. I must have missed that file while cleaning up the code. I will remove it.

  2. Since FEST-3D is a three-dimensional code, 3-nested loop (i-j-k) is found everywhere in the code. One loop for each direction. In our early time-study, we got around 50% speedup with optimization flag -O3 and nested loops, which made the speedup from array manipulation insignificant. Also, more importantly, the use of array manipulation will lead to significant memory consumption by the temporary variables required by the complex algorithms. So, we chose to use the nested loop for large algorithms to save on memory consumption. However, I agree that for simple algorithms, array manipulation is a better choice. I will go through the code and try to change nested loops with array manipulation, whereever possible.

  3. As far I can understand, the do concurrent will allow the loops to run in parallel. In order to make loops fast, we have started working on GPU parallelization of loops. The code is already a CPU parallelized in which background mesh decomposed over the different processors. We believe using do concurrent will not add any additional benefit.

  4. We are using Ubuntu 18.04 (LTS) platform for development purpose and Gfortran9 is not available on this platform. Our system will be updated when the next LTS version of Ubuntu will be released. We will start porting the FEST-3D to Fortran 2018 once those updates are available. The use of derived datatype is a major change for FEST-3D and will be included in this porting.

  5. For most of the cases, I have used the deferred length syntax (len=*) for the fixed-length string. In case I have made a mistake in this regard, I will find it and fix it.

  6. and 7. Most of the variables related to fluid flow are declared in the "src/global/global_vars.f90". The idea was that a single-file gives the user a one-stop to check any declaration and definition of the variable used. Even the declared pointer's "use" is defined in the comments in "global_vars.f90". Most of the names given to the variables are clear enough so that the reader does not have to go back the check the definition. Variables required by the module are called using the use statement with "only" to make the specific variable use in the module clear. That is why there is a long list of the use statement. This is the reason why there is no argument for most of the procedure calls and we were not able to write unit tests. But it saves us from passing the long list of variables as argument across the different procedures. However, I can see your point that this procedure of using global variables can have its own side effects. I will try to change this by making variables local to the modules.

  7. Deallocating and closing of files are just done at the end of the simulation and it does not add any overhead cost. It may be possible that most of the compiler automatically takes care of this. However, we maintained symmetry in our code with what has been "setup" must be "destroyed" at the end of the simulation.

  8. The pointers here are used to give an alias to the section contagious allocated memory of all the state variables used by the code. This help us to write more readable form of the equations used by different schemes. As stated in the last ponit, there is no deallocation of memory before the all the iterations in the simulations are finished.

  9. Sometimes, when we test the code with non-blocking sendrecv, we encountered errors related to memory. Later we found that the problem is related to the buffer used with non-blocking senrecv. So we have mostly used blocking sendrecv.

rouson commented 4 years ago

Regarding Fortran standards

Each Fortran standard is very nearly backwards compatible. Very few features get deleted or modified so Fortran 90 is very nearly subset of Fortran 2018 and I don't see any obvious cases where you're using features that have been deleted from the language. In this sense, you're already writing Fortran 2018. The key question is why restrict yourself to a subset of the language that is nearly 30 years old? Unless the Fortran committee has wasted the last 3 decades, there almost has to be features of value that you miss by by such a severe restriction. Based on my review so far, I believe there are many and I suspect my list would grow if I reviewed more of the code. I also think it does the Fortran community a disservice to write new code in such an old subset of the language. For one thing, it limits the community of people who will be interested in contributing to the project. When I see projects writing new code in Fortran 90, I ask how will you attract developers under the age of 30 when you tell them they can't use any language features that were invented in their entire lifetime? That can be a hard sell.

Regarding performance and/or memory usage

I never believe performance claims or memory utilization claims unless they come with specific empirical data about the subset of code being discussed and its impact on the overall application performance. Before asking someone to read through hundreds of lines of loops nested 4 levels deep that could be made roughly 90% shorter by simply invoking the sum intrinsic function or another intrinsic function, I would want to see a statement such as "This specific block of code occupies x% of the overall application runtime in important use cases y and is sped up by z% by writing nested loops instead of simply calling sum."

Regarding compiler versions

Regarding MPI

I'm not an MPI expert so I'll defer to your knowledge about how non-blocking communication has performed in your application. An even more important point is that it's not clear to me why new software would embed MPI of any form in the source code when Fortran has had its own parallel programming abstractions since Fortran 2008. If you go the above route to install gfortran, you can also use the same installer to install OpenCoarrays, which then gives you access to parallel features that completely obviate the need to embed MPI in your source. The parallel runtime library can generate the MPI commands for you. Most importantly, it will generate MPI-3 one-sided communication which is more advanced than even the aforementioned non-blocking communication. This might or might not speed up your application, but it's certainly a good place to start when writing new code. Moreover, it opens up the possibility to swap MPI for another parallel programing model such as OpenSHMEM without changing even one line of source code.

I wouldn't write any of what I'm writing if this were an old project building on decades-old legacy code. I just don't think new code should be written in the manner I see in this project.