[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error.

joeylamcy commented 4 years ago

Hi everyone,

I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with forrtl: error (73): floating divide by zero. The full log files are attached below. 163214_print_out.log 163214_error.log

More information:

intel MPI with Intel 18 compiler
ESMF 8.0.0 public release built with ESMF_COMM=intelmpi

I'm not sure how to troubleshoot this issue. I tried to cmake the source code with -DCMAKE_BUILD_TYPE=Debug (with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:

forrtl: error (63): output conversion error, unit -5, file Internal Formatted Write
Image              PC                Routine            Line        Source
geos               00000000094A364E  Unknown               Unknown  Unknown
geos               00000000094F8D62  Unknown               Unknown  Unknown
geos               00000000094F6232  Unknown               Unknown  Unknown
geos               000000000226CC73  advcore_gridcompm         261  AdvCore_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos               0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos               0000000007F00A0D  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               0000000007F01D4E  Unknown               Unknown  Unknown
geos               0000000007F01A85  Unknown               Unknown  Unknown
geos               0000000007EE1304  Unknown               Unknown  Unknown
geos               0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos               0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos               0000000007F00B39  Unknown               Unknown  Unknown
geos               0000000007F0470B  Unknown               Unknown  Unknown
geos               00000000083BF095  Unknown               Unknown  Unknown
geos               0000000007F0219A  Unknown               Unknown  Unknown
geos               000000000844804D  Unknown               Unknown  Unknown
geos               0000000007EE2A0F  Unknown               Unknown  Unknown
geos               0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos               0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos               0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos               0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos               0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos               00000000004242FF  MAIN__                     29  GCHPctm.F90
geos               000000000042125E  Unknown               Unknown  Unknown
geos               000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002AFBC9F34505  __libc_start_main     Unknown  Unknown
geos               0000000000421169  Unknown               Unknown  Unknown

I also noticed something weird towards the start of the run:

      MAPL: No configure file specified for logging layer.  Using defaults. 
     SHMEM: NumCores per Node = 6
     SHMEM: NumNodes in use   = 1
     SHMEM: Total PEs         = 6
     SHMEM: NumNodes in use  = 1

Previous versions (12.8.2) usually shows this instead:

 In MAPL_Shmem:
     NumCores per Node =            6
     NumNodes in use   =            1
     Total PEs         =            6

 In MAPL_InitializeShmem (NodeRootsComm):
     NumNodes in use   =            1

but I'm not sure if that matters.

sdeastham commented 4 years ago

I suspect that the error in AdvCore_GridCompMod is misleading, but that is something we should fix. In AdvCore_GridCompMod.F90, ntracers is set to 11 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86). However, this leads to a formatting error, because a later loop for N = 1, ntracers tries to write to a string using a single-digit integer format with N-1 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L260-L269). The fix for that particular error is obvious - just set ntracers to 10 (ntracers doesn't seem to be a particularly important variable, and is only used to define these "test outputs").

I found that setting ntracers=10 does fix this error and allows you to find whatever the REAL error is. @lizziel we should raise this with GMAO and kick a pull request up the chain!

joeylamcy commented 4 years ago

Oh, right. I checked the log for the debug run again and actually the run ended way earlier than without the debug flag, so I suppose using -DCMAKE_BUILD_TYPE=Debug couldn't help me.

sdeastham commented 4 years ago

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

lizziel commented 4 years ago

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

lizziel commented 4 years ago

Regarding the debug flags issue, I created an issue on GEOS-ESM/FVdycoreCubed_GridComp: https://github.com/GEOS-ESM/FVdycoreCubed_GridComp/issues/71.

joeylamcy commented 4 years ago

Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.

Yes. If all collections in HISTORY.rc are commented out, the run continues smoothly. But turning on any number of the collections seems to cause the problem, i.e. not specific to one of the collections.

joeylamcy commented 4 years ago

It should help - you'll just need to fix the ntracers issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.

EDIT: By "fix the ntracers issue, I literally mean change the line ntracers = 11 to ntracers = 10 in AdvCore_GridCompMod.F90 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..

Actually I tried it, but there is some further issues. The printout is still stuck at

NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE     0: tracer_manager_init : No tracers are available to be registered.
 ncnst=           0  num_prog=           0  pnats=           0  dnats=
           0  num_family=           0

 Grid distance at face edge (km)=   163384.217664128

and the error log grows at a rate of ~100MB/min for at least 5 minutes, so I just manually stop the run. The leading error is still in AdvCore_GridCompMod.F90.

forrtl: warning (406): fort: (1): In call to MPI_GROUP_INCL, an array temporary was created for argument #3

Image              PC                Routine            Line        Source
geos.debug         00000000094A5440  Unknown               Unknown  Unknown
geos.debug         0000000005A6D950  mpp_mod_mp_get_pe         109  mpp_util_mpi.inc
geos.debug         0000000005A9EE8F  mpp_mod_mp_mpp_in          55  mpp_comm_mpi.inc
geos.debug         0000000004A0B3A6  fms_mod_mp_fms_in         342  fms.F90
geos.debug         000000000226E3A3  advcore_gridcompm         311  AdvCore_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006829035  mapl_genericmod_m        4580  MAPL_Generic.F90
geos.debug         0000000000425200  gchp_gridcompmod_         138  GCHP_GridCompMod.F90
geos.debug         0000000007F00A0D  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         0000000007F01D4E  Unknown               Unknown  Unknown
geos.debug         0000000007F01A85  Unknown               Unknown  Unknown
geos.debug         0000000007EE1304  Unknown               Unknown  Unknown
geos.debug         0000000006827DDA  mapl_genericmod_m        4545  MAPL_Generic.F90
geos.debug         0000000006A52D6C  mapl_capgridcompm         482  MAPL_CapGridComp.F90
geos.debug         0000000007F00B39  Unknown               Unknown  Unknown
geos.debug         0000000007F0470B  Unknown               Unknown  Unknown
geos.debug         00000000083BF095  Unknown               Unknown  Unknown
geos.debug         0000000007F0219A  Unknown               Unknown  Unknown
geos.debug         000000000844804D  Unknown               Unknown  Unknown
geos.debug         0000000007EE2A0F  Unknown               Unknown  Unknown
geos.debug         0000000006A67F42  mapl_capgridcompm         848  MAPL_CapGridComp.F90
geos.debug         0000000006A39B5E  mapl_capmod_mp_ru         321  MAPL_Cap.F90
geos.debug         0000000006A370A7  mapl_capmod_mp_ru         198  MAPL_Cap.F90
geos.debug         0000000006A344ED  mapl_capmod_mp_ru         157  MAPL_Cap.F90
geos.debug         0000000006A32B5F  mapl_capmod_mp_ru         131  MAPL_Cap.F90
geos.debug         00000000004242FF  MAIN__                     29  GCHPctm.F90
geos.debug         000000000042125E  Unknown               Unknown  Unknown
libc-2.17.so       00002B9C8AD6A505  __libc_start_main     Unknown  Unknown
geos.debug         0000000000421169  Unknown               Unknown  Unknown

sdeastham commented 4 years ago

The array temporary warnings are irrelevant - given enough time, the code should still reach the actual error - but I agree that it's not really helpful to have them padding the error log. They do also very much slow down the run, so although the printout appears stuck it should eventually clear.

@LiamBindle - can you recommend a preferred way to suppress array temporary warnings in FV3 using CMake? I can imagine that one could do this by editing the contents of ESMA_cmake, but that seems non-ideal.

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).

Can you verify that your copy of MAPL_HistoryGridComp.F90 also has call o_Clients%done_collective_stage() on line 3570? That will give us a thread to tug on with GMAO.
Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.
Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4)?

lizziel commented 4 years ago

I noticed you are running a c48 standard simulation with 6 cores and 3G per core across 1 node, if the log file prints are to be trusted. It surprises me that the simulation ran without running out of memory. You can try upping your resources and lowering your to resolution c24 to see if that makes a difference at all for the diagnostics.

Also try commenting out individual collections to see if there is a specific history collection consistently causing the problem.

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

LiamBindle commented 4 years ago

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

lizziel commented 4 years ago

I am going to put this update into the GCHPctm 13.00-alpha.10 pre-release.

LiamBindle commented 4 years ago

@lizziel I think you can do "SHELL:-check noarg_temp_created" to get it to work for ifort 18 and 19, if ifort 19 doesn't like the comma.

lizziel commented 4 years ago

Following up about the original issue, we have another report of a similar divide by zero floating point error while writing diagnostics:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
geos               0000000001FBCA6F  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AD70A4335D0  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE6A14  Unknown               Unknown  Unknown
libnetcdf.so.13.0  00002AD706AE4B4B  NC4_def_var           Unknown  Unknown
libnetcdf.so.13.0  00002AD706A10B5B  nc_def_var            Unknown  Unknown
libnetcdff.so.6.1  00002AD706524DB4  nf_def_var_           Unknown  Unknown
geos               0000000001B0E765  m_netcdf_io_defin         218  m_netcdf_io_define.F9\
0
geos               0000000001B62855  ncdf_mod_mp_nc_va        3866  ncdf_mod.F90
geos               000000000187B19E  history_netcdf_mo         465  history_netcdf_mod.F9\
0
geos               0000000001876EB6  history_mod_mp_hi        2925  history_mod.F90
geos               0000000000412C17  MAIN__                   2076  main.F90
geos               000000000040C4DE  Unknown               Unknown  Unknown
libc-2.17.so       00002AD70A8663D5  __libc_start_main     Unknown  Unknown
geos               000000000040C3E9  Unknown               Unknown  Unknown

This was using an older version of GCHPctm. It was fixed by switching to a different set of libraries, including netcdf. Try honing in on @sdeastham's suggestion:

Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

joeylamcy commented 4 years ago

@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).
1. Can you verify that your copy of `MAPL_HistoryGridComp.F90` also has `call o_Clients%done_collective_stage()` on line 3570? That will give us a thread to tug on with GMAO.

Yes.

2. Can you post the output of `ifort --version`, `nc-config --all`, and `nf-config --all`? It seems like something is going amiss deep in NetCDF.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ifort --version
ifort (IFORT) 18.0.2 20180210
Copyright (C) 1985-2018 Intel Corporation.  All rights reserved.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nc-config --all

This netCDF 4.6.1 has been built with the following features: 

  --cc        -> icc
  --cflags    -> -I/opt/share/netcdf-4.6.1/include 
  --libs      -> -L/opt/share/netcdf-4.6.1/lib -lnetcdf

  --has-c++   -> no
  --cxx       -> 

  --has-c++4  -> no
  --cxx4      -> 

  --has-fortran-> yes
  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf
  --has-f90   -> no
  --has-f03   -> yes

  --has-dap   -> yes
  --has-dap4  -> yes
  --has-nc2   -> yes
  --has-nc4   -> yes
  --has-hdf5  -> yes
  --has-hdf4  -> no
  --has-logging-> no
  --has-pnetcdf-> no
  --has-szlib -> no
  --has-parallel -> no
  --has-cdf5 -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --libdir    -> /opt/share/netcdf-4.6.1/lib
  --version   -> netCDF 4.6.1

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nf-config --all

This netCDF-Fortran 4.4.4 has been built with the following features: 

  --cc        -> icc
  --cflags    ->  -I/opt/share/netcdf-4.6.1/include 

  --fc        -> ifort
  --fflags    -> -I/opt/share/netcdf-4.6.1/include
  --flibs     -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf 
  --has-f90   -> no
  --has-f03   -> yes

  --has-nc2   -> yes
  --has-nc4   -> yes

  --prefix    -> /opt/share/netcdf-4.6.1
  --includedir-> /opt/share/netcdf-4.6.1/include
  --version   -> netCDF-Fortran 4.4.4

side note: During cmake .., there is an error saying that hdf5 is missing, so I manually export CMAKE_PREFIX_PATH=/opt/share/hdf5-1.10.2. If this matters, feel free to let me know and I can try reproducing that.

3. Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run `ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4`)?

HDF errors, and the sizes are obviously not right too. Meanwhile, ncdump-ing the MERRA-2 data is fine.

[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.DryDep.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.DryDep.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4 
ncdump: OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ls -lh OutputDir/
total 12K
-rw-r--r--. 1 s1155064480 AmosTai 23 Aug 27 20:09 FILLER
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.DryDep.20160701_0030z.nc4
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep  2 17:18 GCHP.SpeciesConc.20160701_0030z.nc4

Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.

I changed duration to 1-hr and run start/end/duration to 1-hr as well, but the same floating divide by zero error occurs.

I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?

I didn't try other alpha versions. But we used them when building 12.8.2 of the old GCHP.

joeylamcy commented 4 years ago

I believe those temporary array warnings can be suppressed with -check,noarg_temp_created.

Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.

Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!

Yep, it works and only a few warnings are left before the error messages. Now I get the same errors as @lizziel did in https://github.com/GEOS-ESM/ESMA_cmake/issues/125#issuecomment-685043199

lizziel commented 4 years ago

I made a fix for the errors you are now getting with debug flags on. See https://github.com/GEOS-ESM/FVdycoreCubed_GridComp/issues/71#.

sdeastham commented 4 years ago

I'm very suspicious about the issue with HDF5 during cmake; @LiamBindle , any thoughts?

joeylamcy commented 4 years ago

I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.

I'm getting those errors (from GetPointer.H and MAPL_Generic.F90) after the fix though.

lizziel commented 4 years ago

Did you also move the conditional for N <= ntracers? This solved that for me. Regardless, I now get past advection and am getting a new error in History. This is a problem in the GMAO MAPL library. I think it is safe to say using debug flags in GCHP is not yet ready. I am working with GMAO to get fixes for the bugs I am finding into their code.

I agree with @sdeastham that the focus for your issue should be on the netcdf/HDF5 library. Could you post your environment file, CMakeCache.txt, CMakeFiles/CMakeError.log, and CMakeFiles/CMakeOutput.log?

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

LiamBindle commented 4 years ago

I'm a bit suprised it didn't pick up HDF5 automatically considering it picked up NetCDF automatically, but @joeylamcy did the correct thing in pointing CMake to the appropriate HDF5 library with CMAKE_PREFIX_PATH.

The fact it's crashing in a nc_def_var call (deep in HISTORY) after writing 96 bytes, to me, suggests it's something obscure to do with NetCDF. The fact the simulation runs okay when output collections are turned off supports that too. It looks like the checkpoint file is being written okay, so it isn't consistent. I would agree with the suggestions to

Try a different version of NetCDF/NetCDF-Fortran
Increase the resources (I'm suprised 18 G is enough for C48)

joeylamcy commented 4 years ago

gchp_13.0.0.env.txt CMakeCache.txt CMakeError.log CMakeOutput.log

netcdf-4.6.1 is sourced upon login. The shell script is as follow:

export NETCDF=/opt/share/netcdf-4.6.1
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include/:$INCLUDE

joeylamcy commented 4 years ago

We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.

It looks promising. Does spack need root access?

WilliamDowns commented 4 years ago

Spack does not require root access. Those instructions should be fine for getting setup with OpenMPI and GNU compilers; Intel MPI and/or Intel compilers also work but require a bit more setup that we haven't written out yet on the Wiki. You also won't need to manually define as many environment variables when loading NetCDF through Spack / other package managers. I've pasted a working environment file below (change SPACK_ROOT and ESMF_DIR as needed):

spack unload
export SPACK_ROOT=/path/to/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load emacs
#==============================================================================
# %%%%% Load Spackages %%%%%
#==============================================================================
spack load gcc@9.3.0
spack load git%gcc@9.3.0
spack load cmake%gcc@9.3.0
spack load openmpi%gcc@9.3.0
spack load netcdf-fortran%gcc@9.3.0^openmpi

export MPI_ROOT=$(spack location -i openmpi)

# Make all files world-readable by default
umask 022

# Specify compilers
export CC=gcc
export CXX=g++
export FC=gfortran

# For ESMF
export ESMF_COMPILER=gfortran
export ESMF_COMM=openmpi
export ESMF_DIR=/path/to/ESMF
export ESMF_INSTALL_PREFIX=${ESMF_DIR}/INSTALL_openmpi_gfortran93
# For GCHP
export ESMF_ROOT=${ESMF_INSTALL_PREFIX}

#==============================================================================
# Set limits
#==============================================================================

#ulimit -c 0                      # coredumpsize
export OMP_STACKSIZE=500m
ulimit -l unlimited              # memorylocked
ulimit -u 50000                  # maxproc
ulimit -v unlimited              # vmemoryuse
ulimit -s unlimited              # stacksize

#==============================================================================
# Print information
#==============================================================================

#module list
echo ""
echo "Environment:"
echo ""
echo "CC: ${CC}"
echo "CXX: ${CXX}"
echo "FC: ${FC}"
echo "ESMF_COMM: ${ESMF_COMM}"
echo "ESMF_COMPILER: ${ESMF_COMPILER}"
echo "ESMF_DIR: ${ESMF_DIR}"
echo "ESMF_INSTALL_PREFIX: ${ESMF_INSTALL_PREFIX}"
echo "ESMF_ROOT: ${ESMF_ROOT}"
echo "MPI_ROOT: ${MPI_ROOT}"
echo "NetCDF C: $(nc-config --prefix)"
#echo "NetCDF Fortran: $(nf-config --prefix)"
echo ""
echo "Done sourcing ${BASH_SOURCE[0]}"

lizziel commented 4 years ago

I noticed in one of your outputs it lists this as your netcdf-fortran: --version -> netCDF-Fortran 4.4.4

The user who had the same issue as you was actually using GEOS-Chem Classic. But he found this:

Update is that the simulation appears to have successfully finished using Lizzie's new environment file. So I guess the old environment file I used to use with GEOS-Chem classic no longer works with version 12.9.3. My old environment file used netcdf-fortran/4.4.4-fasrc06 and yours uses netcdf-fortran/4.5.2-fasrc01

We definitely would love for you to try spack. Another route, however, is to see if you can get a newer netcdf-fortran version since at least one other person had an issue with 4.4.4 starting with GEOS-Chem 12.9.

joeylamcy commented 4 years ago

I see a newer netcdf-fortran version. Do I also need to rebuild ESMF?

WilliamDowns commented 4 years ago

EDIT: Sorry, I'm not actually sure if you need to rebuild ESMF specifically when changing NetCDF-Fortran libraries. The GCST will likely be away from this thread until Tuesday, so if you run into any more issues a rebuild of ESMF might help.

LiamBindle commented 4 years ago

@joeylamcy @WilliamDowns Yeah, you'll need to rebuild ESMF if you change NetCDF versions

joeylamcy commented 3 years ago

Just want to post an update: I am able to finish a trial run with proper output using intel compilers 19.0.4, intel MPI, netcdf-c 4.7.1 and netcdf-fortran 4.5.2. However, I have not succeeded with any multi-node runs. Do you have any tested configuration of core counts and memory usage? Or perhaps any tips on multi-node runs in general?

WilliamDowns commented 3 years ago

So far with Intel MPI I've successfully done a test at c90 with the following settings in a slurm script using mpirun instead of srun (getting errors with srun that need to be sorted out):

#SBATCH -n 360
#SBATCH -N 12
#SBATCH --exclusive
#SBATCH -t 0-03:00
#SBATCH --mem=MaxMemPerNode

A 1 week run takes about 2 hours in this setup. I need to test with other setting configurations including lowering memory allocation. This is also with gfortran 9.3 rather than ifort. What sorts of issues are you running into with your multi-node runs?

WilliamDowns commented 3 years ago

I've now also successfully used srun without crashing (unclear if the run will complete in its allotted time), but you might find it finnicky when trying to specify a PMI version that matches your cluster's Slurm setup (I cannot get Intel MPI to tolerate using PMIx but it works fine with PMI2 on Harvard's Cannon cluster). To use srun, you'll need to set an extra environment variable I_MPI_PMI_LIBRARY to point the PMI library used by Slurm and then specify the corresponding PMI version in your call to srun. For example, to use PMI2 I set export I_MPI_PMI_LIBRARY=/path/to/libpmi2.so in my environment file and add --mpi=pmi2 to my srun call.

joeylamcy commented 3 years ago

So far with Intel MPI I've successfully done a test at c90 with the following settings in a slurm script using mpirun instead of srun (getting errors with srun that need to be sorted out):
#SBATCH -n 360
#SBATCH -N 12
#SBATCH --exclusive
#SBATCH -t 0-03:00
#SBATCH --mem=MaxMemPerNode
A 1 week run takes about 2 hours in this setup. I need to test with other setting configurations including lowering memory allocation. This is also with gfortran 9.3 rather than ifort. What sorts of issues are you running into with your multi-node runs?

Hmm, I don't think I will ever get 360 cores. And how much memory per node is this?

LiamBindle commented 3 years ago

Hi Joey,

Have you tried a C48 simulation on 2 nodes? If not, I'd recommend trying a simulation like that. If you have and it failed, could you share the run log?

A C48 simulation should have pretty similar resource requirements to a 2x2.5 GEOS-Chem Classic simulation. A C48 simulation can run on a single node, but trying it on two is a good way to test/try multinode simulations. Does your cluster run SLURM or is it a different scheduler? If it's SLURM then the example run scripts in the run directory might be useful to look through to see how they work.

If you aren't using SLURM, that's okay--our cluster here at WashU run LSF for example. Here's an example of a LSF job for a C48 simulation I ran a few weeks back.

#!/usr/bin/bash
#BSUB -q general
#BSUB -n 60
#BSUB -W 336:00
#BSUB -R "rusage[mem=100000] span[ptile=30] select[mem < 2000000]"
#BSUB -a 'docker(registry.gsc.wustl.edu/sleong/base-engineering-gcc)'
#BSUB -o lsf-run-%J-output.txt

# Source bashrc
. /etc/bashrc

# Set up runtime environment
set -x                           # Print executed commands
set -e                           # Exit immediately if a command fails
ulimit -c 0                      # coredumpsize
ulimit -l unlimited              # memorylocked
ulimit -u 50000                  # maxproc
ulimit -v unlimited              # vmemoryuse
ulimit -s unlimited              # stacksize

# Execute simulation
rm -f cap_restart gcchem*
chmod +x runConfig.sh geos
./runConfig.sh
export TMPDIR="$__LSF_JOB_TMPDIR__"
mpirun -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -np 24 ./geos

Note that this job asks for 60 cores across 2 nodes (ptile=30 means 30 cores per node), and 100 GB of memory per node. According to the post job stats, the average memory usage for this simulation was ~90 GB (total) and it's peak usage was ~125 GB. I'm not sure I'd trust those numbers too precisely, but that might give you a rough idea.

Also not that I'm using OpenMPI with GNU compilers here. When I use Intel MPI and ifort I don't need to set TMPDIR or LD_LIBRARY_PATH. Starting with a C48 simulation on 2 nodes should help you narrow in on any MPI-specific configuration settings that you might need.

joeylamcy commented 3 years ago

While every attempt of running c48 simulation on 2 nodes failed, the results seem diverse and inconsistent. Firstly, on a single node, I used 6 cores and specify --mem=50G and the run (JobID: 189464) finishes properly. I then tried to use 2 nodes with 6 cores and 50G memory on each node, but the run (JobID: 189493) fails from the beginning. Then I tried to use 2 nodes with 6 cores and maximum memory (192G) per node, and the two runs (JobID: 189495 & 189499) fail during output, with different error messages. 189499_print_out.log 189499_error.log 189495_print_out.log 189495_error.log 189493_print_out.log 189493_error.log 189464_print_out.log

LiamBindle commented 3 years ago

It looks like this is most likely something to do with the MPI configuration--@lizziel @WilliamDowns do you have any ideas?

WilliamDowns commented 3 years ago

Out of curiosity, could you set FI_LOG_LEVEL=debug in your environment setup and post logs from your next run? The PMPI_Win_create error in 189493 is a bug I've run into when running on the Amazon cloud when using their EFA fabric provider and I'm curious if this shows up on other systems with other fabric setups. Try setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 in your environment (from this comment) and redoing the run from 189493; this may fix that issue. This might also fix 189495/189499.

joeylamcy commented 3 years ago

Thank you @WilliamDowns. Using MPIR_CVAR_CH4_OFI_ENABLE_RMA=0 allows me to complete the run using 2 nodes with 6 cores each. Also, here are the log files for the run with FI_LOG_LEVEL=debug. 190308_error.log 190308_print_out.log

Thank you very much for all the generous help everyone offers here!

joeylamcy commented 3 years ago

Sorry, I'm reopening this because I found that the output is still erroneous when I'm using 2 cores. After some investigation, I realize that for the output nc4 file generated using 2 nodes with 6 cores on each node, the lats and lons arrays on 3 of the 6 faces are wrong. The values appear to be similar to O3 concentration values instead (I didn't check, but the order of magnitude is 10^-8). The nc4 file can be obtained from here. Using a single core to run the simulation does not cause the issue.

I was originally outputting SpeciesConc on a 2x2.5 lat-lon grid and attached are the output of ozone concentration at lev=1 (Plotted using Panoply). The first one is simulated using 6 core on single node while the second is simulated using 2 nodes with 6 cores on each node. Other run information can be found in runConfig.sh and HISTORY.rc. O3Conc_1core O3Conc_2cores

It then leads me to outputting on the original cubed-sphere grid, and then leads me to the above conclusion.

lizziel commented 3 years ago

Hi @joeylamcy, does this issue only happen when outputting to a lat-lon grid?

joeylamcy commented 3 years ago

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes. ncdump_lats_2nodes.txt ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.

EDIT: Instead of yes, I actually mean NO. I am using cubed sphere grid and the lats array are still wrong.

lizziel commented 3 years ago

Okay, I will see if I can reproduce given your lat/lon grid definition. Our standard testing currently does not include the lat/lon output option of MAPL so this very well may be a bug that went under the radar. I will report back when I have more information.

joeylamcy commented 3 years ago

Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4 for simulations using 1 node and 2 nodes. ncdump_lats_2nodes.txt ncdump_lats_1node.txt

You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid. .

I'm sorry. I actually mean NO. The issue exist even when I used the default cubed-sphere grid. Sorry about misreading your question.

LiamBindle commented 3 years ago

To clarify, it appears that even with CS output the lats coordinates have some bad parts (a bunch of nearly zeros). Currently I'm seeing if I can reproduce the bad CS lats coordinates.

I'm trying

1 node, 6 cores, 100 GB mem
2 nodes, 12 cores, 100 GB mem per node.

I'll report back in a bit.

lizziel commented 3 years ago

Great, thanks @LiamBindle!

LiamBindle commented 3 years ago

@joeylamcy Sorry you're running into this--thank you for your patience. I suspect there's a bug somewhere that's causing this problem.

I've tried a bunch of configurations, and unfortunately I haven't been able to reproduce the problem. I've tried

1 node, 6 cores, 120 GB memory (Note: I had to increase to 120 GB because my sims crash with 100 GB)
2 nodes, 6 cores, 100 GB memory per node
A bunch of combinations of grid/conservative collections settings on 1 nodes and 2 nodes

Can you try running GCHP with this HISTORY.rc? Can you try this with a 1 node, 2 node, and 4 node simulation? Could you share the output for these?

Additional question: are you still using 13.0.0-alpha.9?

LiamBindle commented 3 years ago

It appears the the issue is with the coordinates, but the output data is okay. I downloaded GCHP.SpeciesConc_CS.20160701_0030z.nc4 which you shared above. Plotting it with its coordinates is bad, as you've reported.

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds['lons'].isel(nf=nf).values
    y = ds['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

However, if I plot SpeciesConc_O3 from your output, but I use lats and lons from one of my outputs it looks okay

import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr

# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()

ds_good_coords = xr.open_dataset('GCHP.MyTestCollectionNative.20160701_0030z.nc4')
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()

# Plot data
for nf in range(6):
    x = ds_good_coords['lons'].isel(nf=nf).values
    y = ds_good_coords['lats'].isel(nf=nf).values
    v = da.isel(nf=nf).values
    plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)

plt.show()

Figure_1

So it appears it's the lats and lons coordinates that are bad. If you could run the simulations I suggested above, that might help us narrow in on the problem.

joeylamcy commented 3 years ago

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Additional question: are you still using 13.0.0-alpha.9?

Yes.

LiamBindle commented 3 years ago

One last thing I should note. The lats and lons coordinates aren't yet well tested. GCPy, gcgridobj, and my own plotting scripts calculate grid-box coordinates externally. This is because if you want to plot CS data, you need grid-box corners, but they aren't included in the diagnostics yet (see https://github.com/geoschem/GCHPctm/issues/38). Corner coordinates will be in the diagnostics starting in 13.1. So starting in 13.1, you won't need post-process calculate grid-box corners.

It looks like your 2 node simulation's diagnostic were okay, with the exception of the lats and lons coordinates. If you want to start using GCHP immediately, a temporary workaround would be recalculating the coordinates post-simulation. That way you could start using GCHP immediately, but this obviously would just be a temporary work around. We definitely still need to figure out what's causing the bad coordinates. If you want to do this, let me know and I can follow up with some instructions.

@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.

Thanks, I'm looking forward to seeing the results.

joeylamcy commented 3 years ago

You can check the results on: https://mycuhk-my.sharepoint.com/:f:/g/personal/1155064480_link_cuhk_edu_hk/EpwESaXqXDlKuesfj6mhQ0wB0JgVhfh0EB1LSUd5Re_AJQ?e=0KU6Vg

EDIT: link edited.

LiamBindle commented 3 years ago

@joeylamcy I can't seem to open the link. Could you review that and let me know when I can try again?

joeylamcy commented 3 years ago

@LiamBindle My apologies. I have edited the permission settings. Please try again.

geoschem / GCHP

[BUG/ISSUE] Trial run with 13.0.0-alpha.9 version crashes after ~1 simulation hour and gives floating divide by zero error. #37