Closed joeylamcy closed 3 years ago
I suspect that the error in AdvCore_GridCompMod is misleading, but that is something we should fix. In AdvCore_GridCompMod.F90
, ntracers
is set to 11 (https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86). However, this leads to a formatting error, because a later loop for N = 1, ntracers
tries to write to a string using a single-digit integer format with N-1
(https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L260-L269). The fix for that particular error is obvious - just set ntracers
to 10 (ntracers
doesn't seem to be a particularly important variable, and is only used to define these "test outputs").
I found that setting ntracers=10
does fix this error and allows you to find whatever the REAL error is. @lizziel we should raise this with GMAO and kick a pull request up the chain!
Oh, right. I checked the log for the debug run again and actually the run ended way earlier than without the debug flag, so I suppose using -DCMAKE_BUILD_TYPE=Debug
couldn't help me.
It should help - you'll just need to fix the ntracers
issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.
EDIT: By "fix the ntracers
issue, I literally mean change the line ntracers = 11
to ntracers = 10
in AdvCore_GridCompMod.F90
(https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..
Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.
Regarding the debug flags issue, I created an issue on GEOS-ESM/FVdycoreCubed_GridComp: https://github.com/GEOS-ESM/FVdycoreCubed_GridComp/issues/71.
Hi @joeylamcy, the original error you reported is occurring in MAPL history during diagnostic write. Are you able to make it go away by turning off diagnostics? This might help hone in on the problem.
Yes. If all collections in HISTORY.rc are commented out, the run continues smoothly. But turning on any number of the collections seems to cause the problem, i.e. not specific to one of the collections.
It should help - you'll just need to fix the
ntracers
issue first. Once that's dealt with, it should help you to find the actual issue. It will also generate a lot of warnings (many to do with array temporaries, which aren't genuine problems) but those won't stop the run and can be safely ignored.EDIT: By "fix the
ntracers
issue, I literally mean change the linentracers = 11
tontracers = 10
inAdvCore_GridCompMod.F90
(https://github.com/geoschem/FVdycoreCubed_GridComp/blob/83da4661d62a4d19648a90e11f9ae70b8b38a56d/AdvCore_GridCompMod.F90#L86)! Realized I should have been clearer..
Actually I tried it, but there is some further issues. The printout is still stuck at
NOTE from PE 0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE 0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE 0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE 0: tracer_manager_init : No tracers are available to be registered.
NOTE from PE 0: tracer_manager_init : No tracers are available to be registered.
ncnst= 0 num_prog= 0 pnats= 0 dnats=
0 num_family= 0
Grid distance at face edge (km)= 163384.217664128
and the error log grows at a rate of ~100MB/min for at least 5 minutes, so I just manually stop the run. The leading error is still in AdvCore_GridCompMod.F90
.
forrtl: warning (406): fort: (1): In call to MPI_GROUP_INCL, an array temporary was created for argument #3
Image PC Routine Line Source
geos.debug 00000000094A5440 Unknown Unknown Unknown
geos.debug 0000000005A6D950 mpp_mod_mp_get_pe 109 mpp_util_mpi.inc
geos.debug 0000000005A9EE8F mpp_mod_mp_mpp_in 55 mpp_comm_mpi.inc
geos.debug 0000000004A0B3A6 fms_mod_mp_fms_in 342 fms.F90
geos.debug 000000000226E3A3 advcore_gridcompm 311 AdvCore_GridCompMod.F90
geos.debug 0000000007F00A0D Unknown Unknown Unknown
geos.debug 0000000007F0470B Unknown Unknown Unknown
geos.debug 00000000083BF095 Unknown Unknown Unknown
geos.debug 0000000007F0219A Unknown Unknown Unknown
geos.debug 0000000007F01D4E Unknown Unknown Unknown
geos.debug 0000000007F01A85 Unknown Unknown Unknown
geos.debug 0000000007EE1304 Unknown Unknown Unknown
geos.debug 0000000006827DDA mapl_genericmod_m 4545 MAPL_Generic.F90
geos.debug 0000000006829035 mapl_genericmod_m 4580 MAPL_Generic.F90
geos.debug 0000000000425200 gchp_gridcompmod_ 138 GCHP_GridCompMod.F90
geos.debug 0000000007F00A0D Unknown Unknown Unknown
geos.debug 0000000007F0470B Unknown Unknown Unknown
geos.debug 00000000083BF095 Unknown Unknown Unknown
geos.debug 0000000007F0219A Unknown Unknown Unknown
geos.debug 0000000007F01D4E Unknown Unknown Unknown
geos.debug 0000000007F01A85 Unknown Unknown Unknown
geos.debug 0000000007EE1304 Unknown Unknown Unknown
geos.debug 0000000006827DDA mapl_genericmod_m 4545 MAPL_Generic.F90
geos.debug 0000000006A52D6C mapl_capgridcompm 482 MAPL_CapGridComp.F90
geos.debug 0000000007F00B39 Unknown Unknown Unknown
geos.debug 0000000007F0470B Unknown Unknown Unknown
geos.debug 00000000083BF095 Unknown Unknown Unknown
geos.debug 0000000007F0219A Unknown Unknown Unknown
geos.debug 000000000844804D Unknown Unknown Unknown
geos.debug 0000000007EE2A0F Unknown Unknown Unknown
geos.debug 0000000006A67F42 mapl_capgridcompm 848 MAPL_CapGridComp.F90
geos.debug 0000000006A39B5E mapl_capmod_mp_ru 321 MAPL_Cap.F90
geos.debug 0000000006A370A7 mapl_capmod_mp_ru 198 MAPL_Cap.F90
geos.debug 0000000006A344ED mapl_capmod_mp_ru 157 MAPL_Cap.F90
geos.debug 0000000006A32B5F mapl_capmod_mp_ru 131 MAPL_Cap.F90
geos.debug 00000000004242FF MAIN__ 29 GCHPctm.F90
geos.debug 000000000042125E Unknown Unknown Unknown
libc-2.17.so 00002B9C8AD6A505 __libc_start_main Unknown Unknown
geos.debug 0000000000421169 Unknown Unknown Unknown
The array temporary warnings are irrelevant - given enough time, the code should still reach the actual error - but I agree that it's not really helpful to have them padding the error log. They do also very much slow down the run, so although the printout appears stuck it should eventually clear.
@LiamBindle - can you recommend a preferred way to suppress array temporary warnings in FV3 using CMake? I can imagine that one could do this by editing the contents of ESMA_cmake
, but that seems non-ideal.
@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).
MAPL_HistoryGridComp.F90
also has call o_Clients%done_collective_stage()
on line 3570? That will give us a thread to tug on with GMAO.ifort --version
, nc-config --all
, and nf-config --all
? It seems like something is going amiss deep in NetCDF.ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4
)?I noticed you are running a c48 standard simulation with 6 cores and 3G per core across 1 node, if the log file prints are to be trusted. It surprises me that the simulation ran without running out of memory. You can try upping your resources and lowering your to resolution c24 to see if that makes a difference at all for the diagnostics.
Also try commenting out individual collections to see if there is a specific history collection consistently causing the problem.
Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.
I believe those temporary array warnings can be suppressed with -check,noarg_temp_created
.
Unfortunately, as you suspected @sdeastham, I think manually adding -check,noarg_temp_created
to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.
Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!
I am going to put this update into the GCHPctm 13.00-alpha.10 pre-release.
@lizziel I think you can do "SHELL:-check noarg_temp_created"
to get it to work for ifort 18 and 19, if ifort 19 doesn't like the comma.
Following up about the original issue, we have another report of a similar divide by zero floating point error while writing diagnostics:
forrtl: error (73): floating divide by zero
Image PC Routine Line Source
geos 0000000001FBCA6F Unknown Unknown Unknown
libpthread-2.17.s 00002AD70A4335D0 Unknown Unknown Unknown
libnetcdf.so.13.0 00002AD706AE6A14 Unknown Unknown Unknown
libnetcdf.so.13.0 00002AD706AE4B4B NC4_def_var Unknown Unknown
libnetcdf.so.13.0 00002AD706A10B5B nc_def_var Unknown Unknown
libnetcdff.so.6.1 00002AD706524DB4 nf_def_var_ Unknown Unknown
geos 0000000001B0E765 m_netcdf_io_defin 218 m_netcdf_io_define.F9\
0
geos 0000000001B62855 ncdf_mod_mp_nc_va 3866 ncdf_mod.F90
geos 000000000187B19E history_netcdf_mo 465 history_netcdf_mod.F9\
0
geos 0000000001876EB6 history_mod_mp_hi 2925 history_mod.F90
geos 0000000000412C17 MAIN__ 2076 main.F90
geos 000000000040C4DE Unknown Unknown Unknown
libc-2.17.so 00002AD70A8663D5 __libc_start_main Unknown Unknown
geos 000000000040C3E9 Unknown Unknown Unknown
This was using an older version of GCHPctm. It was fixed by switching to a different set of libraries, including netcdf. Try honing in on @sdeastham's suggestion:
Can you post the output of ifort --version, nc-config --all, and nf-config --all? It seems like something is going amiss deep in NetCDF.
I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?
@joeylamcy That all having been said, I took a more detailed look at your earlier error log to see if it can provide any more information. The line with the div-by-zero is.. unexpected (https://github.com/geoschem/MAPL/blob/fca3b3381515e2c0473ae2268f51130fe18909ff/base/MAPL_HistoryGridComp.F90#L3570).
1. Can you verify that your copy of `MAPL_HistoryGridComp.F90` also has `call o_Clients%done_collective_stage()` on line 3570? That will give us a thread to tug on with GMAO.
Yes.
2. Can you post the output of `ifort --version`, `nc-config --all`, and `nf-config --all`? It seems like something is going amiss deep in NetCDF.
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ifort --version
ifort (IFORT) 18.0.2 20180210
Copyright (C) 1985-2018 Intel Corporation. All rights reserved.
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nc-config --all
This netCDF 4.6.1 has been built with the following features:
--cc -> icc
--cflags -> -I/opt/share/netcdf-4.6.1/include
--libs -> -L/opt/share/netcdf-4.6.1/lib -lnetcdf
--has-c++ -> no
--cxx ->
--has-c++4 -> no
--cxx4 ->
--has-fortran-> yes
--fc -> ifort
--fflags -> -I/opt/share/netcdf-4.6.1/include
--flibs -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf
--has-f90 -> no
--has-f03 -> yes
--has-dap -> yes
--has-dap4 -> yes
--has-nc2 -> yes
--has-nc4 -> yes
--has-hdf5 -> yes
--has-hdf4 -> no
--has-logging-> no
--has-pnetcdf-> no
--has-szlib -> no
--has-parallel -> no
--has-cdf5 -> yes
--prefix -> /opt/share/netcdf-4.6.1
--includedir-> /opt/share/netcdf-4.6.1/include
--libdir -> /opt/share/netcdf-4.6.1/lib
--version -> netCDF 4.6.1
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ nf-config --all
This netCDF-Fortran 4.4.4 has been built with the following features:
--cc -> icc
--cflags -> -I/opt/share/netcdf-4.6.1/include
--fc -> ifort
--fflags -> -I/opt/share/netcdf-4.6.1/include
--flibs -> -L/opt/share/netcdf-4.6.1/lib -lnetcdff -L/opt/share/hdf5-1.10.2/lib -L/opt/share/zlib-1.2.11/lib -L/opt/share/curl-7.59.0/lib -L/opt/share/netcdf-4.6.1/lib -lnetcdf -lnetcdf
--has-f90 -> no
--has-f03 -> yes
--has-nc2 -> yes
--has-nc4 -> yes
--prefix -> /opt/share/netcdf-4.6.1
--includedir-> /opt/share/netcdf-4.6.1/include
--version -> netCDF-Fortran 4.4.4
side note: During cmake ..
, there is an error saying that hdf5 is missing, so I manually export CMAKE_PREFIX_PATH=/opt/share/hdf5-1.10.2
. If this matters, feel free to let me know and I can try reproducing that.
3. Are any output files generated in your OutputDir directory? If so, are they valid (i.e. what happens if you run `ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4`)?
HDF errors, and the sizes are obviously not right too. Meanwhile, ncdump-ing the MERRA-2 data is fine.
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.DryDep.20160701_0030z.nc4
ncdump: OutputDir/GCHP.DryDep.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ncdump -h OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4
ncdump: OutputDir/GCHP.SpeciesConc.20160701_0030z.nc4: NetCDF: HDF error
[s1155064480@chpc-login01 gchp_13.0.0_standard_MERRA2]$ ls -lh OutputDir/
total 12K
-rw-r--r--. 1 s1155064480 AmosTai 23 Aug 27 20:09 FILLER
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep 2 17:18 GCHP.DryDep.20160701_0030z.nc4
-rw-r--r--. 1 s1155064480 AmosTai 96 Sep 2 17:18 GCHP.SpeciesConc.20160701_0030z.nc4
Finally, you are outputting hourly diagnostics daily. I have not tested the case of frequency and duration not being equal in the latest MAPL update. Try setting diagnostic frequency to 24 (or duration to 1-hr and your run start/end/duration to 1-hr as well) in runConfig.sh and see if that changes anything.
I changed duration to 1-hr and run start/end/duration to 1-hr as well, but the same floating divide by zero error occurs.
I also wonder if you were able to get this set of libraries you are using to work with an older version of GCHPctm, and if yes, which one?
I didn't try other alpha versions. But we used them when building 12.8.2 of the old GCHP.
I believe those temporary array warnings can be suppressed with
-check,noarg_temp_created
.Unfortunately, as you suspected @sdeastham, I think manually adding
-check,noarg_temp_created
to this line is going to be the easiest option. It isn't ideal, but it should work. We could discuss options for doing this cleaner, but I'll leave that for another thread.Let me know if you run into any problems suppressing those temporary array warning @joeylamcy!
Yep, it works and only a few warnings are left before the error messages. Now I get the same errors as @lizziel did in https://github.com/GEOS-ESM/ESMA_cmake/issues/125#issuecomment-685043199
I made a fix for the errors you are now getting with debug flags on. See https://github.com/GEOS-ESM/FVdycoreCubed_GridComp/issues/71#.
I'm very suspicious about the issue with HDF5 during cmake; @LiamBindle , any thoughts?
I made a fix for the errors you are now getting with debug flags on. See GEOS-ESM/FVdycoreCubed_GridComp#71.
I'm getting those errors (from GetPointer.H and MAPL_Generic.F90) after the fix though.
Did you also move the conditional for N <= ntracers? This solved that for me. Regardless, I now get past advection and am getting a new error in History. This is a problem in the GMAO MAPL library. I think it is safe to say using debug flags in GCHP is not yet ready. I am working with GMAO to get fixes for the bugs I am finding into their code.
I agree with @sdeastham that the focus for your issue should be on the netcdf/HDF5 library. Could you post your environment file, CMakeCache.txt, CMakeFiles/CMakeError.log, and CMakeFiles/CMakeOutput.log?
We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.
I'm a bit suprised it didn't pick up HDF5 automatically considering it picked up NetCDF automatically, but @joeylamcy did the correct thing in pointing CMake to the appropriate HDF5 library with CMAKE_PREFIX_PATH
.
The fact it's crashing in a nc_def_var
call (deep in HISTORY) after writing 96 bytes, to me, suggests it's something obscure to do with NetCDF. The fact the simulation runs okay when output collections are turned off supports that too. It looks like the checkpoint file is being written okay, so it isn't consistent. I would agree with the suggestions to
gchp_13.0.0.env.txt CMakeCache.txt CMakeError.log CMakeOutput.log
netcdf-4.6.1 is sourced upon login. The shell script is as follow:
export NETCDF=/opt/share/netcdf-4.6.1
export PATH=$NETCDF/bin:$PATH
export LD_LIBRARY_PATH=$NETCDF/lib:$LD_LIBRARY_PATH
export INCLUDE=$NETCDF/include/:$INCLUDE
We also now have documentation on how to build libraries for GCHPctm on Spack. We are looking for beta users to try it out. Are you interested in trying this out? It may solve the issue.
It looks promising. Does spack need root access?
Spack does not require root access. Those instructions should be fine for getting setup with OpenMPI and GNU compilers; Intel MPI and/or Intel compilers also work but require a bit more setup that we haven't written out yet on the Wiki. You also won't need to manually define as many environment variables when loading NetCDF through Spack / other package managers. I've pasted a working environment file below (change SPACK_ROOT and ESMF_DIR as needed):
spack unload
export SPACK_ROOT=/path/to/spack
. $SPACK_ROOT/share/spack/setup-env.sh
spack load emacs
#==============================================================================
# %%%%% Load Spackages %%%%%
#==============================================================================
spack load gcc@9.3.0
spack load git%gcc@9.3.0
spack load cmake%gcc@9.3.0
spack load openmpi%gcc@9.3.0
spack load netcdf-fortran%gcc@9.3.0^openmpi
export MPI_ROOT=$(spack location -i openmpi)
# Make all files world-readable by default
umask 022
# Specify compilers
export CC=gcc
export CXX=g++
export FC=gfortran
# For ESMF
export ESMF_COMPILER=gfortran
export ESMF_COMM=openmpi
export ESMF_DIR=/path/to/ESMF
export ESMF_INSTALL_PREFIX=${ESMF_DIR}/INSTALL_openmpi_gfortran93
# For GCHP
export ESMF_ROOT=${ESMF_INSTALL_PREFIX}
#==============================================================================
# Set limits
#==============================================================================
#ulimit -c 0 # coredumpsize
export OMP_STACKSIZE=500m
ulimit -l unlimited # memorylocked
ulimit -u 50000 # maxproc
ulimit -v unlimited # vmemoryuse
ulimit -s unlimited # stacksize
#==============================================================================
# Print information
#==============================================================================
#module list
echo ""
echo "Environment:"
echo ""
echo "CC: ${CC}"
echo "CXX: ${CXX}"
echo "FC: ${FC}"
echo "ESMF_COMM: ${ESMF_COMM}"
echo "ESMF_COMPILER: ${ESMF_COMPILER}"
echo "ESMF_DIR: ${ESMF_DIR}"
echo "ESMF_INSTALL_PREFIX: ${ESMF_INSTALL_PREFIX}"
echo "ESMF_ROOT: ${ESMF_ROOT}"
echo "MPI_ROOT: ${MPI_ROOT}"
echo "NetCDF C: $(nc-config --prefix)"
#echo "NetCDF Fortran: $(nf-config --prefix)"
echo ""
echo "Done sourcing ${BASH_SOURCE[0]}"
I noticed in one of your outputs it lists this as your netcdf-fortran:
--version -> netCDF-Fortran 4.4.4
The user who had the same issue as you was actually using GEOS-Chem Classic. But he found this:
Update is that the simulation appears to have successfully finished using Lizzie's new environment file. So I guess the old environment file I used to use with GEOS-Chem classic no longer works with version 12.9.3. My old environment file used netcdf-fortran/4.4.4-fasrc06 and yours uses netcdf-fortran/4.5.2-fasrc01
We definitely would love for you to try spack. Another route, however, is to see if you can get a newer netcdf-fortran version since at least one other person had an issue with 4.4.4 starting with GEOS-Chem 12.9.
I see a newer netcdf-fortran version. Do I also need to rebuild ESMF?
EDIT: Sorry, I'm not actually sure if you need to rebuild ESMF specifically when changing NetCDF-Fortran libraries. The GCST will likely be away from this thread until Tuesday, so if you run into any more issues a rebuild of ESMF might help.
@joeylamcy @WilliamDowns Yeah, you'll need to rebuild ESMF if you change NetCDF versions
Just want to post an update: I am able to finish a trial run with proper output using intel compilers 19.0.4, intel MPI, netcdf-c 4.7.1 and netcdf-fortran 4.5.2. However, I have not succeeded with any multi-node runs. Do you have any tested configuration of core counts and memory usage? Or perhaps any tips on multi-node runs in general?
So far with Intel MPI I've successfully done a test at c90 with the following settings in a slurm script using mpirun
instead of srun
(getting errors with srun
that need to be sorted out):
#SBATCH -n 360
#SBATCH -N 12
#SBATCH --exclusive
#SBATCH -t 0-03:00
#SBATCH --mem=MaxMemPerNode
A 1 week run takes about 2 hours in this setup. I need to test with other setting configurations including lowering memory allocation. This is also with gfortran 9.3 rather than ifort. What sorts of issues are you running into with your multi-node runs?
I've now also successfully used srun
without crashing (unclear if the run will complete in its allotted time), but you might find it finnicky when trying to specify a PMI version that matches your cluster's Slurm setup (I cannot get Intel MPI to tolerate using PMIx
but it works fine with PMI2
on Harvard's Cannon cluster). To use srun, you'll need to set an extra environment variable I_MPI_PMI_LIBRARY
to point the PMI library used by Slurm and then specify the corresponding PMI version in your call to srun
. For example, to use PMI2
I set export I_MPI_PMI_LIBRARY=/path/to/libpmi2.so
in my environment file and add --mpi=pmi2
to my srun
call.
So far with Intel MPI I've successfully done a test at c90 with the following settings in a slurm script using
mpirun
instead ofsrun
(getting errors withsrun
that need to be sorted out):#SBATCH -n 360 #SBATCH -N 12 #SBATCH --exclusive #SBATCH -t 0-03:00 #SBATCH --mem=MaxMemPerNode
A 1 week run takes about 2 hours in this setup. I need to test with other setting configurations including lowering memory allocation. This is also with gfortran 9.3 rather than ifort. What sorts of issues are you running into with your multi-node runs?
Hmm, I don't think I will ever get 360 cores. And how much memory per node is this?
Hi Joey,
Have you tried a C48 simulation on 2 nodes? If not, I'd recommend trying a simulation like that. If you have and it failed, could you share the run log?
A C48 simulation should have pretty similar resource requirements to a 2x2.5 GEOS-Chem Classic simulation. A C48 simulation can run on a single node, but trying it on two is a good way to test/try multinode simulations. Does your cluster run SLURM or is it a different scheduler? If it's SLURM then the example run scripts in the run directory might be useful to look through to see how they work.
If you aren't using SLURM, that's okay--our cluster here at WashU run LSF for example. Here's an example of a LSF job for a C48 simulation I ran a few weeks back.
#!/usr/bin/bash
#BSUB -q general
#BSUB -n 60
#BSUB -W 336:00
#BSUB -R "rusage[mem=100000] span[ptile=30] select[mem < 2000000]"
#BSUB -a 'docker(registry.gsc.wustl.edu/sleong/base-engineering-gcc)'
#BSUB -o lsf-run-%J-output.txt
# Source bashrc
. /etc/bashrc
# Set up runtime environment
set -x # Print executed commands
set -e # Exit immediately if a command fails
ulimit -c 0 # coredumpsize
ulimit -l unlimited # memorylocked
ulimit -u 50000 # maxproc
ulimit -v unlimited # vmemoryuse
ulimit -s unlimited # stacksize
# Execute simulation
rm -f cap_restart gcchem*
chmod +x runConfig.sh geos
./runConfig.sh
export TMPDIR="$__LSF_JOB_TMPDIR__"
mpirun -x LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH -np 24 ./geos
Note that this job asks for 60 cores across 2 nodes (ptile=30 means 30 cores per node), and 100 GB of memory per node. According to the post job stats, the average memory usage for this simulation was ~90 GB (total) and it's peak usage was ~125 GB. I'm not sure I'd trust those numbers too precisely, but that might give you a rough idea.
Also not that I'm using OpenMPI with GNU compilers here. When I use Intel MPI and ifort I don't need to set TMPDIR
or LD_LIBRARY_PATH
. Starting with a C48 simulation on 2 nodes should help you narrow in on any MPI-specific configuration settings that you might need.
While every attempt of running c48 simulation on 2 nodes failed, the results seem diverse and inconsistent. Firstly, on a single node, I used 6 cores and specify --mem=50G
and the run (JobID: 189464) finishes properly. I then tried to use 2 nodes with 6 cores and 50G memory on each node, but the run (JobID: 189493) fails from the beginning. Then I tried to use 2 nodes with 6 cores and maximum memory (192G) per node, and the two runs (JobID: 189495 & 189499) fail during output, with different error messages.
189499_print_out.log
189499_error.log
189495_print_out.log
189495_error.log
189493_print_out.log
189493_error.log
189464_print_out.log
It looks like this is most likely something to do with the MPI configuration--@lizziel @WilliamDowns do you have any ideas?
Out of curiosity, could you set FI_LOG_LEVEL=debug
in your environment setup and post logs from your next run? The PMPI_Win_create
error in 189493 is a bug I've run into when running on the Amazon cloud when using their EFA fabric provider and I'm curious if this shows up on other systems with other fabric setups. Try setting MPIR_CVAR_CH4_OFI_ENABLE_RMA=0
in your environment (from this comment) and redoing the run from 189493; this may fix that issue. This might also fix 189495/189499.
Thank you @WilliamDowns. Using MPIR_CVAR_CH4_OFI_ENABLE_RMA=0
allows me to complete the run using 2 nodes with 6 cores each. Also, here are the log files for the run with FI_LOG_LEVEL=debug
.
190308_error.log
190308_print_out.log
Thank you very much for all the generous help everyone offers here!
Sorry, I'm reopening this because I found that the output is still erroneous when I'm using 2 cores. After some investigation, I realize that for the output nc4 file generated using 2 nodes with 6 cores on each node, the lats
and lons
arrays on 3 of the 6 faces are wrong. The values appear to be similar to O3 concentration values instead (I didn't check, but the order of magnitude is 10^-8). The nc4 file can be obtained from here. Using a single core to run the simulation does not cause the issue.
I was originally outputting SpeciesConc on a 2x2.5 lat-lon grid and attached are the output of ozone concentration at lev=1 (Plotted using Panoply). The first one is simulated using 6 core on single node while the second is simulated using 2 nodes with 6 cores on each node. Other run information can be found in runConfig.sh and HISTORY.rc.
It then leads me to outputting on the original cubed-sphere grid, and then leads me to the above conclusion.
Hi @joeylamcy, does this issue only happen when outputting to a lat-lon grid?
Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on Species_Conc
collection. Attached are the outputs of ncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4
for simulations using 1 node and 2 nodes.
ncdump_lats_2nodes.txt
ncdump_lats_1node.txt
You can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid.
EDIT: Instead of yes, I actually mean NO. I am using cubed sphere grid and the lats array are still wrong.
Okay, I will see if I can reproduce given your lat/lon grid definition. Our standard testing currently does not include the lat/lon output option of MAPL so this very well may be a bug that went under the radar. I will report back when I have more information.
Yes. I reverted the changes in HISTORY.rc (i.e. using default output grid in c48 simulations) and only turn on
Species_Conc
collection. Attached are the outputs ofncdump -v lats OutputDir/GCHP.Species_Conc.20160701_0030z.nc4
for simulations using 1 node and 2 nodes. ncdump_lats_2nodes.txt ncdump_lats_1node.txtYou can see that lines 2222-3661 are different, but it doesn't make sense, because these should be the latitudes of the same grid. .
I'm sorry. I actually mean NO. The issue exist even when I used the default cubed-sphere grid. Sorry about misreading your question.
To clarify, it appears that even with CS output the lats
coordinates have some bad parts (a bunch of nearly zeros). Currently I'm seeing if I can reproduce the bad CS lats
coordinates.
I'm trying
I'll report back in a bit.
Great, thanks @LiamBindle!
@joeylamcy Sorry you're running into this--thank you for your patience. I suspect there's a bug somewhere that's causing this problem.
I've tried a bunch of configurations, and unfortunately I haven't been able to reproduce the problem. I've tried
Can you try running GCHP with this HISTORY.rc? Can you try this with a 1 node, 2 node, and 4 node simulation? Could you share the output for these?
Additional question: are you still using 13.0.0-alpha.9?
It appears the the issue is with the coordinates, but the output data is okay. I downloaded GCHP.SpeciesConc_CS.20160701_0030z.nc4
which you shared above. Plotting it with its coordinates is bad, as you've reported.
import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr
# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()
# Plot data
for nf in range(6):
x = ds['lons'].isel(nf=nf).values
y = ds['lats'].isel(nf=nf).values
v = da.isel(nf=nf).values
plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)
plt.show()
However, if I plot SpeciesConc_O3 from your output, but I use lats
and lons
from one of my outputs it looks okay
import matplotlib.pyplot as plt
import cartopy.crs as ccrs # cartopy > 0.18
import xarray as xr
# Set up GeoAxes
ax = plt.axes(projection=ccrs.EqualEarth())
ax.set_global()
ax.coastlines()
ds_good_coords = xr.open_dataset('GCHP.MyTestCollectionNative.20160701_0030z.nc4')
ds = xr.open_dataset('GCHP.SpeciesConc_CS.20160701_0030z.nc4')
da = ds['SpeciesConc_O3'].isel(lev=0).squeeze()
vmin = da.quantile(0.02).item()
vmax = da.quantile(0.98).item()
# Plot data
for nf in range(6):
x = ds_good_coords['lons'].isel(nf=nf).values
y = ds_good_coords['lats'].isel(nf=nf).values
v = da.isel(nf=nf).values
plt.pcolormesh(x, y, v, transform=ccrs.PlateCarree(), vmin=vmin, vmax=vmax)
plt.show()
So it appears it's the lats
and lons
coordinates that are bad. If you could run the simulations I suggested above, that might help us narrow in on the problem.
@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.
Additional question: are you still using 13.0.0-alpha.9?
Yes.
One last thing I should note. The lats
and lons
coordinates aren't yet well tested. GCPy, gcgridobj, and my own plotting scripts calculate grid-box coordinates externally. This is because if you want to plot CS data, you need grid-box corners, but they aren't included in the diagnostics yet (see https://github.com/geoschem/GCHPctm/issues/38). Corner coordinates will be in the diagnostics starting in 13.1. So starting in 13.1, you won't need post-process calculate grid-box corners.
It looks like your 2 node simulation's diagnostic were okay, with the exception of the lats
and lons
coordinates. If you want to start using GCHP immediately, a temporary workaround would be recalculating the coordinates post-simulation. That way you could start using GCHP immediately, but this obviously would just be a temporary work around. We definitely still need to figure out what's causing the bad coordinates. If you want to do this, let me know and I can follow up with some instructions.
@LiamBindle Thank you for your prompt replies. I will post the test results tomorrow if our cluster is not too crowded.
Thanks, I'm looking forward to seeing the results.
You can check the results on: https://mycuhk-my.sharepoint.com/:f:/g/personal/1155064480_link_cuhk_edu_hk/EpwESaXqXDlKuesfj6mhQ0wB0JgVhfh0EB1LSUd5Re_AJQ?e=0KU6Vg
EDIT: link edited.
@joeylamcy I can't seem to open the link. Could you review that and let me know when I can try again?
@LiamBindle My apologies. I have edited the permission settings. Please try again.
Hi everyone,
I'm trying to run a 30-core 1-day trial simulation with the 13.0.0-alpha.9 version, but the run ended after ~1 simulation hour and escaped with
forrtl: error (73): floating divide by zero
. The full log files are attached below. 163214_print_out.log 163214_error.logMore information:
ESMF_COMM=intelmpi
I'm not sure how to troubleshoot this issue. I tried to cmake the source code with
-DCMAKE_BUILD_TYPE=Debug
(with the fix in #35) and rerun the simulation, but it gives a really large error log file so I'm not attaching it here. The first few lines of the error log are:I also noticed something weird towards the start of the run:
Previous versions (12.8.2) usually shows this instead:
but I'm not sure if that matters.