EDmodel / ED2

Ecosystem Demography Model
78 stars 112 forks source link

Memory errors when running ED-2.2 #327

Open julianpistorius opened 3 years ago

julianpistorius commented 3 years ago

Running it on University of Arizona HPC, using a Singularity container.

...
- Simulating:   04/29/2008 00:00:00 UTC
 === Time integration ends; Total elapsed time=     8012.8  ===
 ------ ED-2.2 execution ends ------
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
*** Error in `/usr/local/bin/ed.2.2.0': corrupted size vs. prev_size: 0x000055559fb209a0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x2aaaacf4abfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x2aaaacf50fc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x773b8)[0x2aaaacf513b8]
/lib/x86_64-linux-gnu/libc.so.6(+0x78dfa)[0x2aaaacf52dfa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x2aaaacf54f64]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(H5P_close+0x1d1)[0x2aaaab2f5871]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(+0x156a71)[0x2aaaab284a71]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(H5SL_try_free_safe+0x6f)[0x2aaaab332bbf]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(H5I_clear_type+0xa1)[0x2aaaab2853a1]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(H5P_term_package+0x55)[0x2aaaab2f5125]
/usr/lib/x86_64-linux-gnu/libhdf5_serial.so.100(H5_term_library+0x42d)[0x2aaaab170e7d]
/lib/x86_64-linux-gnu/libc.so.6(+0x35940)[0x2aaaacf0f940]
/lib/x86_64-linux-gnu/libc.so.6(+0x3599a)[0x2aaaacf0f99a]
/usr/lib/x86_64-linux-gnu/libgfortran.so.3(_gfortran_stop_string+0x49)[0x2aaaac2586f9]
/usr/local/bin/ed.2.2.0(+0x886d)[0x55555555c86d]
/usr/local/bin/ed.2.2.0(+0x7d5f)[0x55555555bd5f]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x2aaaacefa2e1]
/usr/local/bin/ed.2.2.0(+0x7d8a)[0x55555555bd8a]

Full logfile.txt, and scripts used to run the job: https://gist.github.com/julianpistorius/e120e6d573f68f5fea0f4bf3dce2bd1b

The HPC consulting team suggested using more memory for the job (I used 12 GB most recently, and currently have a job running with 120 GB).

xiangtaoxu commented 3 years ago

The error message suggests there is a floating point exception due to division by zero but there was no clue where the error happened in the model. I think the only way is to reproduce the error (e.g. restart from the closest -S- file and run with an executable compiled in debugging model, such as -k C). It will be much easier to identify the problem once we know the code that caused the error.

As for memory usage, ED2 is not really memory intensive. For a single one-site simulation with <1000 cohorts, it only takes me less than 1GB of memory.

julianpistorius commented 3 years ago

Thanks @xiangtaoxu. I'll talk to the team.

mpaiao commented 3 years ago

Although by looking at the log file, it seems that the simulation reached the end, at least based on this message:

=== Time integration ends; Total elapsed time=     8012.8  ===
 ------ ED-2.2 execution ends ------

In any case, I agree that it may be good to run the simulation with debugging options to see if the error messages appearing in the end are caused by some floating point exception in ED2 for your settings. Also, I noticed that you are using XML to set parameters, so maybe it is worth checking that all parameters needed to redefine the PFTs are set (some other ED2 folks more familiar with the XML interface may be able to give more up-to-date insights).

julianpistorius commented 3 years ago

Thank you @mpaiao. Hopefully somebody familiar with the XML could shine light on this.

mpaiao commented 3 years ago

@julianpistorius I saw your email on the warning messages in utils_c.c, but I don't see it on the GitHub issue. In any case, I normally wouldn't advise to ignore warnings, but utils_c.c is a legacy set of ancillary functions borrowed from BRAMS (atmospheric model) that no one is really developing in ED2. If the model compilation doesn't give errors, then I would ignore these warnings (different story if the warning messages are showing in the fortran code).

julianpistorius commented 3 years ago

Thank you @mpaiao! Yes I deleted the message, because I realized they were just warnings, and didn't prevent the binary from being created.

I now have a Docker image with the debug version of ED2: https://hub.docker.com/r/jpistorius/model-ed2-2.2.0

I turned that into a Singularity image. Unfortunately when I tried to run it on our one HPC system here at Arizona it failed:

FATAL: kernel too old
ERROR IN MODEL RUN

The kernel is a bit old:

$ uname -r
2.6.32-754.35.1.el6.x86_64

Now I'm trying to run it on our other HPC system which at least has a 3.x series kernel:

$ uname -r
3.10.0-1160.11.1.el7.x86_64

If that still doesn't work I'm going to use a large OpenStack virtual machine (or bare metal node) with a recent kernel, just to get the debug output.

Will update here with progress.

Update: The Singularity image with the debug binary is working on the other HPC cluster. Running now. Will hopefully soon have more useful error messages.

julianpistorius commented 3 years ago

I'll have to try compiling ED2 again. The output I got is actually less useful than what I had before:


 - Simulating:   04/29/2008 00:00:00 UTC
 === Time integration ends; Total elapsed time=     7576.8  ===
 ------ ED-2.2 execution ends ------
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_OVERFLOW_FLAG IEEE_UNDERFLOW_FLAG IEEE_DENORMAL
free(): invalid next size (fast)

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x2b5c8465fd3a
#1  0x2b5c8465eed5
#2  0x2b5c84ae520f
#3  0x2b5c84ae518b
#4  0x2b5c84ac4858
#5  0x2b5c84b2f3ed
#6  0x2b5c84b3747b
#7  0x2b5c84b38d2b
#8  0x2b5c84d9ab8f
#9  0x2b5c84eb8cb7
#10  0x2b5c84eb9d5e
#11  0x2b5c84eba055
#12  0x2b5c84de9de2
#13  0x2b5c84ea3ad9
#14  0x2b5c84dea854
#15  0x2b5c84eb944d
#16  0x2b5c84cca738
#17  0x2b5c84ae8a26
#18  0x2b5c84ae8bdf
#19  0x2b5c84661ef4
#20  0x55a5aaf8a65b
#21  0x55a5aaf89abe
#22  0x2b5c84ac60b2
#23  0x55a5aaf89aed
#24  0xffffffffffffffff
mpaiao commented 3 years ago

@julianpistorius Double check that you have the trace back option enabled when compiling the code (-fbacktrace if you are using gfortran, -traceback if using ifort). In case you already have this option, then I suspect that the error messages are coming from the HDF5 (!), not ED2, at least based on your first post.

julianpistorius commented 3 years ago

@mpaiao I did not have -fbacktrace before. I am building it now, and I'm using mpif90. 🤞

This is my install.sh command: ./install.sh -k A -g -p VM

Here's my /src/ED2/ED/build/make/include.mk.VM file:

#Makefile include include.mk.opt.ubuntu
############################################################################

# Define make (gnu make works best).
MAKE=/usr/bin/make

# libraries.
BASE=$(ED_ROOT)/build/

# HDF 5  Libraries
USE_HDF5=1
HDF5_INCS=-I/usr/include/hdf5/serial
HDF5_LIBS=-L/usr/lib/x86_64-linux-gnu/hdf5/serial -lz -lhdf5_fortran -lhdf5 -lhdf5_hl
#HDF5_INCS=-I/usr/include
#HDF5_LIBS=-lz -lhdf5_fortran -lhdf5 -lhdf5_hl
USE_COLLECTIVE_MPIO=0

# netCDF libraries
USENC=0
NC_LIBS=-L/dev/null

# interface
USE_INTERF=1

# MPI_Wtime
USE_MPIWTIME=1

# gfortran
CMACH=PC_LINUX1
F_COMP=mpif90
F_OPTS=-O0 -ffree-line-length-none -fno-whole-file -fbacktrace -fcheck=all,no-array-temps -Wall
C_COMP=mpicc
C_OPTS=-O0 -g -static
LOADER=mpif90
LOADER_OPTS=${F_OPTS}
C_LOADER=mpicc
LIBS=
MOD_EXT=mod

# using MPI libraries:
MPI_PATH=
PAR_INCS=
PAR_LIBS=
PAR_DEFS=-DRAMS_MPI

# For IBM,HP,SGI,ALPHA,LINUX use these:
ARCHIVE=ar rs
julianpistorius commented 3 years ago

Got an error:

...
 - Simulating:   10/01/2004 00:00:00 UTC
 - Simulating:   10/02/2004 00:00:00 UTC
At line 1114 of file canopy_struct_dynamics.f90
Fortran runtime error: Index '0' of dimension 1 of array 'canstr%lad8' below lower bound of 1

Error termination. Backtrace:
#0  0x7ff7f502ed01 in ???
#1  0x7ff7f502f849 in ???
#2  0x7ff7f502fec6 in ???
#3  0x5607ba66d37b in ???
#4  0x5607ba2f70d6 in ???
#5  0x5607ba324c82 in ???
#6  0x5607b9a80968 in ???
#7  0x5607b9959879 in ???
#8  0x5607b9955270 in ???
#9  0x5607b9955354 in ???
#10  0x7ff7f4cd90b2 in ???
#11  0x5607b9953a4d in ???
#12  0xffffffffffffffff in ???
ERROR IN MODEL RUN
xiangtaoxu commented 3 years ago

@julianpistorius It is interesting to have an out of bound error...

It seems kapartial is zero, likely caused by a zero value in ncanlyr? kapartial = min(ncanlyr,floor ((hbotcrown * zztop0i8)**ehgti8) + 1)

I do not have time right now to track the calculation of ncanlyr but my guess is some inappropriate parameter in XML cascade into calculations of ncanlyr... Will check later..

mpaiao commented 3 years ago

From the files shared, it doesn't look like ncanlyr was changed in the xml file, so the default value should be 100. I don't see how kapartial could be zero, other than because of some very strange round error that is making the (hbotcrown * zztop0i8)**ehgti8 term to be less than zero. If you have access to a debugger, you may be able to inspect and extract all these values when the model crashes.

Otherwise, good-old print statements may help, add this temporary chunk of code right after line 1090 (which defines kzfull):

               !---------------------------------------------------------------------------!
               !     Temporary sanity check.                                               !
               !---------------------------------------------------------------------------!
               if ( kapartial < 1 .or. kapartial > ncanlyr .or.                            &
                    kzpartial < 1 .or. kzpartial > ncanlyr .or.                            &
                    kafull    < 1 .or. kafull    > ncanlyr .or.                            &
                    kzfull    < 1 .or. kzfull    > ncanlyr ) then
                  write(unit=*,fmt='(a)'          ) '---------------------------------'
                  write(unit=*,fmt='(a)'          ) 'Layer indices are out of range!'
                  write(unit=*,fmt='(a)'          ) '---------------------------------'
                  write(unit=*,fmt='(a,1x,i12)'   ) 'NCANLYR   =',ncanlyr
                  write(unit=*,fmt='(a,1x,i12)'   ) 'PFT       =',cpatch%pft (ico)
                  write(unit=*,fmt='(a,1x,es12.5)') 'DBH       =',cpatch%dbh (ico)
                  write(unit=*,fmt='(a,1x,es12.5)') 'HEIGHT    =',cpatch%hite(ico)
                  write(unit=*,fmt='(a,1x,es12.5)') 'LAI       =',cpatch%lai (ico)
                  write(unit=*,fmt='(a,1x,es12.5)') 'WAI       =',cpatch%wai (ico)
                  write(unit=*,fmt='(a,1x,es12.5)') 'HBOTCROWN =',hbotcrown
                  write(unit=*,fmt='(a,1x,es12.5)') 'HTOPCROWN =',htopcrown
                  write(unit=*,fmt='(a,1x,es12.5)') 'ZZTOP0I8  =',zztop0i8
                  write(unit=*,fmt='(a,1x,es12.5)') 'EHGTI8    =',ehgti8
                  write(unit=*,fmt='(a,1x,i12)'   ) 'KAPARTIAL =',kapartial
                  write(unit=*,fmt='(a,1x,i12)'   ) 'KAFULL    =',kafull
                  write(unit=*,fmt='(a,1x,i12)'   ) 'KAPARTIAL =',kapartial
                  write(unit=*,fmt='(a,1x,i12)'   ) 'KZFULL    =',kzfull
                  write(unit=*,fmt='(a)'          ) '---------------------------------'
                  call fatal_error('Invalid canopy layer indices.','canopy_turbulence8'    &
                                  ,'canopy_struct_dynamics.f90')
               end if
               !---------------------------------------------------------------------------!

and run the code again. This should print all the information needed to understand what is happening.

julianpistorius commented 3 years ago

Great! Thanks @xiangtaoxu & @mpaiao. That helps a lot. I'll let you know what happens.

julianpistorius commented 3 years ago

@xiangtaoxu & @mpaiao - this is what the output was:

 - Simulating:   10/01/2004 00:00:00 UTC
 - Simulating:   10/02/2004 00:00:00 UTC
---------------------------------
Layer indices are out of range!
---------------------------------
NCANLYR   =          100
PFT       =            1
DBH       =  0.00000E+00
HEIGHT    =  0.00000E+00
LAI       =  0.00000E+00
WAI       =  0.00000E+00
HBOTCROWN =  5.00000E-02
HTOPCROWN =  0.00000E+00
ZZTOP0I8  =  2.00000E+01
EHGTI8    =  6.74812E-01
KAPARTIAL =            1
KAFULL    =            2
KAPARTIAL =            1
KZFULL    =            0
---------------------------------
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

--------------------------------------------------------------
                     !!! FATAL ERROR !!!                      
--------------------------------------------------------------
    ---> File:        canopy_struct_dynamics.f90
    ---> Subroutine:  canopy_turbulence8
    ---> Reason:      Invalid canopy layer indices.
--------------------------------------------------------------
 ED execution halts (see previous error message)...
--------------------------------------------------------------
Note: The following floating-point exceptions are signalling: IEEE_DIVIDE_BY_ZERO
STOP fatal_error
ERROR IN MODEL RUN
julianpistorius commented 3 years ago

I'm guessing htopcrown should not be 0.

mpaiao commented 3 years ago

Htopcrown is zero because this is a strange singularity. Cohorts should never have zero height or dbh. I don't know how this is happening. My first suggestion would be to "hide" config.xml and run the model with the default parameters, to see if the problem persists.

xiangtaoxu commented 3 years ago

@julianpistorius yes. htopcrown becomes zero, which causes the problem

since htopcrown = dble(cpatch%hite(ico)), it is suggesting that the tallest cohort in the patch has a height of zero. To me it reflects a problem of allometry. You might want to go back to check your height allometry parameters in XML (e.g. some values in the following parameters must cause problems for PFT 1, which is a grass PFT).

0.1000000015 0.0351999998 0.6940000057 61.7000007629 0.5 1.5 0.1211817637 0.5971379876 10 0.5971379876 0.1576947272 0.1576947272 0.9749494195 0.9749494195 0.7442804575 0.0627227873 0.0647229999 2.4323608875 2.4255735874

I would suggest you plot your height and biomass allometry offline to see what might be wrong.

xiangtaoxu commented 3 years ago

Htopcrown is zero because this is a strange singularity. Cohorts should never have zero height or dbh. I don't know how this is happening. My first suggestion would be to "hide" config.xml and run the model with the default parameters, to see if the problem persists.

Agree with Marcos

julianpistorius commented 3 years ago

Thank you for your suggestions. I'm going to ask one of my team members to try to plot the height and biomass allometry, while I try the following:

My first suggestion would be to "hide" config.xml...

I'm not sure what's the minimum required configuration for ED2. What's the best way to do this?

Some options I could think of, ranging from most extreme to least extreme:

a. Rename config.xml b. Make an empty config.xml c. Make aconfig.xml with no <pft> tags d. Make aconfig.xml with the same <pft> tags, but remove all child tags other than the <num> tags

mpaiao commented 3 years ago

You can move config.xml to a different location so you don't lose it, and run the simulation again. If ED2 doesn't find the xml file, it will use the default parameters.

julianpistorius commented 3 years ago

Update: I moved the config.xml file and reran the job. It has passed the date where it crashed last time (10/02/2004) and has not crashed yet.

Will update here if it finishes successfully, or crashes with some other interesting error.

julianpistorius commented 3 years ago

The run completed successfully with the default parameters. Thank you both very much. I'll work with my colleagues on figuring out what in the config.xml is breaking things and let you know what we figure out.

KristinaRiemer commented 3 years ago

@mpaiao are the default parameter values for ED2 documented somewhere? Maybe in the repo?

mpaiao commented 3 years ago

All the default parameter values in ED2 are assigned in ed_params.f90. I also included most of them in the supporting information of our GMD paper. However, as new features are added in the model, the GMD tables will become less comprehensive over time.

At least with the most up-to-date version, if a variable is assigned through xml then xml has the last word. Otherwise, then ED2 will use the values from xml. If I remember correctly, @femeunier changed the code a few months ago to always print an xml file with all the parameters, so if you don't provide an xml, ED2 will generate one, which may be useful to compare with the differences between your settings and the default one.

dlebauer commented 2 years ago

This issue can be closed