AMReX-Codes / amrex

AMReX: Software Framework for Block Structured AMR
https://amrex-codes.github.io/amrex
Other
546 stars 347 forks source link

Advection_F Tutorial runtime error with DEBUG #337

Closed kweide closed 5 years ago

kweide commented 6 years ago

Runtime failure of Advection_F compiled with debugging:

klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ OMP_NUM_THREADS=1 ./main2d.gnu.DEBUG.MPI3.OMP.ex inputs
MPI initialized with 1 MPI processes
OMP initialized with 1 OMP threads
Successfully read inputs file ... 

 STEP           1 starts ...
At line 461 of file ../../../../../Src/Base/AMReX_mempool_mod.F90
Fortran runtime error: Index '1' of dimension 4 of array 'a' below lower bound of 4294967295

Some info on the environment:

 klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ git branch -vv
* development       ce1bc98 [origin/development] Merge branch 'development' of github.com:AMReX-Codes/amrex into development
  .....
klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ type mpicxx mpif90
mpicxx is /opt/openmpi-1.8.6_gcc/bin/mpicxx
mpif90 is /opt/openmpi-1.8.6_gcc/bin/mpif90
klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ mpicxx --version
g++ (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ mpif90 --version
GNU Fortran (Ubuntu 4.8.4-2ubuntu1~14.04.3) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

Modified GNUmakefile by hand.

kweide commented 6 years ago

Stack trace from gdb for additional information:

klaus@asterix:/home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex$ OMP_NUM_THREADS=1 gdb ./main2d.gnu.DEBUG.MPI3.OMP.ex
GNU gdb (Ubuntu 7.7.1-0ubuntu5~14.04.2) 7.7.1
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./main2d.gnu.DEBUG.MPI3.OMP.ex...done.
(gdb) set args inputs
(gdb) b exit
Breakpoint 1 at 0x40ea60
(gdb) r
Starting program: /home/aladin/klaus/projects/amrex/Tutorials/Amr/Advection_F/Exec/SingleVortex/main2d.gnu.DEBUG.MPI3.OMP.ex inputs
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff45ef700 (LWP 1820)]
MPI initialized with 1 MPI processes
OMP initialized with 1 OMP threads
Successfully read inputs file ... 

 STEP           1 starts ...
At line 461 of file ../../../../../Src/Base/AMReX_mempool_mod.F90
Fortran runtime error: Index '1' of dimension 4 of array 'a' below lower bound of 4294967295

Breakpoint 1, __GI_exit (status=2) at exit.c:104
104 exit.c: No such file or directory.
(gdb) where
#0  __GI_exit (status=2) at exit.c:104
#1  0x00007ffff73a624b in _gfortran_runtime_error_at () from /usr/lib/x86_64-linux-gnu/libgfortran.so.3
#2  0x0000000000601672 in amrex_mempool_module::bl_deallocate_r4 (a=...) at ../../../../../Src/Base/AMReX_mempool_mod.F90:461
#3  0x000000000066732b in amrex_fab_module::amrex_fab_destroy (fab=...) at ../../../../../Src/F_Interfaces/Base/AMReX_fab_mod.F90:89
#4  0x0000000000667b9e in amrex_fab_module::amrex_fab_build_alloc (fab=..., bx=..., nc=2) at ../../../../../Src/F_Interfaces/Base/AMReX_fab_mod.F90:46
#5  0x000000000066647a in amrex_fab_module::amrex_fab_resize (this=..., bx=..., nc=2) at ../../../../../Src/F_Interfaces/Base/AMReX_fab_mod.F90:126
#6  0x00000000005b5c85 in __compute_dt_module_MOD_est_timestep._omp_fn.0 () at ../../../../../Tutorials/Amr/Advection_F/Source/compute_dt_mod.F90:70
#7  0x00000000005b4d7b in compute_dt_module::est_timestep (lev=0, time=0) at ../../../../../Tutorials/Amr/Advection_F/Source/compute_dt_mod.F90:65
#8  0x00000000005b5150 in compute_dt_module::compute_dt () at ../../../../../Tutorials/Amr/Advection_F/Source/compute_dt_mod.F90:24
#9  0x00000000005aef14 in evolve_module::evolve () at ../../../../../Tutorials/Amr/Advection_F/Source/evolve_mod.F90:33
#10 0x00000000005a665a in MAIN__ () at ../../../../../Tutorials/Amr/Advection_F/Source/fmain.F90:19
(gdb) 
WeiqunZhang commented 6 years ago

This appears to be a gnu compiler bug in OpenMP. In Tutorials/Amr/Advection_F/Source/compute_dt_mod.F90, function est_timestep (lev, time), I modified the code to be

    call amrex_print(u%bx)
    print *, "nc = ", u%nc
    print *, "p ", associated(u%fp)

    !$omp parallel reduction(min:dt_est) private(umax,bx,u,mfi,p)
    call amrex_print(u%bx)
    print *, "nc = ", u%nc
    print *, "p ", associated(u%fp)

    call amrex_mfiter_build(mfi, phi_new(lev), tiling=.true.)
    do while(mfi%next())
       bx = mfi%nodaltilebox()

I also modified Src/F_Interfaces/Base/AMReX_fab_mod.F90 to make fp public.

I ran with one thread, and I got

  STEP           1 starts ...
((1,1) (1,1) (0,0))
 nc =            0
 p  F
((71081616,32766) (32592,538976288) (1,1))
 nc =            0
 p  T

So immediately after entering the omp parallel region, the private type(amrex_fab) u became garbage. This seems to be a compiler bug.

kweide commented 6 years ago

Thanks for looking into this!

What compiler version were you using?

WeiqunZhang commented 6 years ago

gcc 5.4.0

kweide commented 6 years ago

Hmm, isn't it completely expected that u becomes (potentially) garbage after the !$om parallel? After all, it is listed in a private clause, not firstprivate.

WeiqunZhang commented 6 years ago

amrex_fab is defined as

     type(amrex_box) :: bx
     integer         :: nc = 0
     logical, private :: owner = .false.
     type(c_ptr), private :: cp = c_null_ptr
     real(amrex_real), private, pointer, dimension(:,:,:,:) :: fp => null()

So I think they should be initialized.

WeiqunZhang commented 6 years ago

My understanding of Fortran standard could be wrong. But if I were wrong on this, the standard were seriously flawed, in my opinion.

WeiqunZhang commented 6 years ago

I added a new function to amrex_fab. We can call u%reset_omp_private() immediately after omp parallel. Note that this a very dangerous function. It will cause memory leak if it is called on fab that has memory already allocated.

Could you test to see if it works?

WeiqunZhang commented 6 years ago

By the way, USE_MPI3 has not been tested thoroughly. So I do not recommend using it.

kweide commented 6 years ago

Thanks, I will do some testing later.

One would think that those default values in the declaration of type(amrex_fab) should result in a well-defined state of the the 'privatized' copies of u, but apparently not so... I found these words in the OpenMP 4.5 specification under 2.15.3.3 (same text in current draft for OpenMP 5.0 under 2.22.2), and I think they apply here:

If any statement of the construct references a list item, a new list item of the same type and type parameters is allocated. This allocation occurs once for each task generated by the construct and once for each SIMD lane used by the construct. The initial value of the new list item is undefined. The initial status of a private pointer is undefined.

This is a Fortran-specific section.

Maybe your (and my) expectations are shaped by what would happen in C++ with in an analogous situation. The standard documents state in C/C++ section, just a few lines above the quoted text:

The new list item is initialized, or has an undefined initial value, as if it had been locally declared without an initializer.

WeiqunZhang commented 6 years ago

I think you are right.

Later in that section,

Finalization of a list item of a finalizable type or subojects of a list item of a finalizable type occurs at the end of the region.

So private variables are finalized but not initialized. The finalizer could be called on uninitialized objects. Don't understand why it is this way.

It's just funny that C++ is safer than Fortran.

kweide commented 6 years ago

Later in that section,

Finalization of a list item of a finalizable type or subojects of a list item of a finalizable type occurs at the end of the region.

So private variables are finalized but not initialized. The finalizer could be called on uninitialized objects. Don't understand why it is this way.

Well, Fortran now has final routines but not really constructors, at least not in the same sense as C++ ...

Since OpenMP says the "initial value ... is undefined", I guess it is okay for a compiler to do the expected thing and define them anyway. Perhaps many compilers do. Maybe there is even a flag to turn this on for GFortran?

It's just not okay for code to assume that the compiler implements the expected initialization behavior.

kweide commented 6 years ago

Could you test to see if it works?

Works for me now.

(Same configuration as before; tried OMP_NUM_THREAD=1 and > 1, number of MPI tasks = 1 and > 1.)

kweide commented 6 years ago

I have tried two alternative ways to avoid the "undefined after privatization" situation. Both appear to work (in Advection_F, as before), no calls to reset_omp_private() necessary.

  1. Declare the affected amrex_fab variables as firstprivate. Changed line, in the case of evolve_mod.F90:
    !$omp parallel private(mfi,bx,tbx,pin,pout,pux,puy,puz,pfx,pfy,pfz,pf,pfab,particles) firstprivate(uface,flux)
  2. Declare the affected amrex_fab variables as threadprivate. Changed lines:
    type(amrex_fab),save :: uface(amrex_spacedim)
    type(amrex_fab),save ::  flux(amrex_spacedim)
    !$omp threadprivate(uface,flux)
    ...
    !$omp parallel private(mfi,bx,tbx,pin,pout,pux,puy,puz,pfx,pfy,pfz,pf,pfab,particles)

I kinda like the firstprivate() way. What do you think?

WeiqunZhang commented 6 years ago

This would depends on how firstprivate variables are initialized. If all it does is memcpy, that will be fine. But if they call assignment(=) of amrex_fab, then we will have problems because it will try to free an uninitialized pointer. I am not sure what the OpenMP standard says.

kweide commented 6 years ago

From OpenMP 4.5, 2.15.3.4:

25 If the original list item does not have the POINTER attribute, initialization of the new list items
26 occurs as if by intrinsic assignment, unless the original list item has the allocation status of not
27 currently allocated, in which case the new list items will have the same status.

:+1:

From openmp-TR7.pdf (draft OpenMP 5.0), 2.22.4.4:

If the original list item does not have the POINTER attribute, initialization of the new list items occurs as if by intrinsic assignment unless the list item has a type bound procedure as a defined assignment. If the original list item that does not have the POINTER attribute has the allocation status of unallocated, the new list items will have the same status.

:grimacing: The draft doesn't seem to say what happens when the "list item" does have "a type bound procedure as a defined assignment", at least I didn't find it.