Closed aidanheerdegen closed 4 years ago
Rui forwarded this info
I built access-om2 model with ompi v1,2,3,4 + intel compiler 2019 and they all work fine for 1deg and 0.25deg examples. However, the two 0.1deg examples i.e. 01deg_jra55_iaf 01deg_jra55_ryf crashed for all builds including the original one with openmpi 1.10.2+ intel compiler 2017.
With Andrew's advice the above issue has been fixed by changing ice_ocean_timestep from 600 to 300.
Another issue on 01deg_jra55_iaf: It simply stopped after reading zbgc_nml and the warning messages in access-om2.err are " ice: Input nprocs not same as system request ".
bash-4.1$ more access-om2.out YATM_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c matmxx: LIBACCESSOM2_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 32768. &MPP_IO_NML HEADER_BUFFER_VAL = 16384, GLOBAL_FIELD_ON_ROOT_PE = T, IO_CLOCKS_ON = F, SHUFFLE = 1, DEFLATE_LEVEL = 5 / NOTE from PE 0: MPP_IO_SET_STACK_SIZE: stack size set to 131072. NOTE from PE 0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to 115200.
======== MODEL BEING DRIVEN BY OCEAN_SOLO_MOD ========
&OCEAN_SOLO_NML N_MASK = 0, LAYOUT_MASK = 20, MASK_LIST = 40960, RESTART_INTERVAL = 6*0, DEBUG_THIS_MODULE = F, ACCESSOM2_CONFIG_DIR = ../
/ mom5xx: LIBACCESSOM2_COMMIT_HASH=b6caeab4bdc1dcab88847d421c6e5250c7e70a2c Reading setup_nml Reading grid_nml Reading tracer_nml Reading thermo_nml Reading dynamics_nml Reading shortwave_nml Reading ponds_nml Reading forcing_nml NOTE from PE 0: diag_manager_mod::diag_manager_init: prepend_date only supported when diag_manager_init is called with time_init present. Diagnostic output will be in file ice_diag.d
MPI_ABORT was invoked on rank 5744 in communicator MPI_COMM_WORLD with errorcode 1.
Any advice on fixing this issue? Thanks.
You'll need to change nprocs
in ice/cice_in.nml
to match ncpus
under name: ice
in config.yaml
.
This is set up correctly in https://github.com/COSIMA/01deg_jra55_iaf - are you using something different?
I execute the example from cosima-om2 repository: https://github.com/OceansAus/access-om2.git
By comparing it with the one from https://github.com/COSIMA/01deg_jra55_iaf I can see some differences
queue: normal 6c6 < ncpus: 5180
ncpus: 5968 30c31 < ncpus: 799
ncpus: 1392
, ndtd = 3 50c50 < nprocs = 799
nprocs = 1600
52c52 < , distribution_type = 'sectrobin'
, distribution_type = 'roundrobin' 177d176 < , highfreq = .true.
So the the example in cosima-om2 package uses sandybridge nodes and specify 1392 cores for ice in config.yaml but 1600 in cice_in.nml.
The case from https://github.com/COSIMA/01deg_jra55_iaf uses broadwell nodes with consistent 799 cores for cice in both config.yaml and cice_in.nml.
I will try the example from https://github.com/COSIMA/01deg_jra55_iaf and it seems the example link within https://github.com/OceansAus/access-om2.git needs to be updated.
Yes, I'm in the process of updating everything in https://github.com/OceansAus/access-om2, including these control dirs. If you want to try the very latest (bleeding edge) config, use branch ak-dev
on https://github.com/COSIMA/01deg_jra55_iaf
The example from https://github.com/COSIMA/01deg_jra55_iaf works for all builds using ompi v1,2,3,4 as the CPU cores used for cice is consistent between config.yaml and cice_in.nml. Again I also need to change the original value of ice_ocean_timestep from 450 to 300 to avoid errors happened for 01deg_jra55_ryf.
comment from #127 in December: @marshallward has found that the model runs reliably and efficiently when built with openMPI3.0.3 using the Intel 19 compiler (since the .mod files are not compatible with Intel 18).
I gather we may also need to migrate to a newer netcdf library on gadi. I suppose something in the latest 4.6.x series would be most future-proof. There's some discussion here: https://github.com/COSIMA/libaccessom2/issues/24
@aekiss NetCDF 4.7.0 has been out for a while now (since the start of May this year), so I'd suggest looking into using that one...
Thanks but 4.6.1 seems to be the newest module on raijin
If you want the new version, just send an e-mail to the helpdesk and someone will install it for you :-) .
Note that modules are loaded in numerous places, which would all need updating:
${ACCESS_OM_DIR}/src/mom/bin/environs.nci
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.360x300
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.1440x1080
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.3600x2700
${ACCESS_OM_DIR}/src/libaccessom2/build_on_raijin.sh
${ACCESS_OM_DIR}/src/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/util/make_dir/config.nci
Have I missed anything?
libcheck.sh
is intended to give us an overview of all this.
https://github.com/COSIMA/access-om2/pull/178 builds with intel-compiler/2019.5.281 netcdf/4.7.1 openmpi/4.0.1
I guess we can close this issue now?
gadi now has openMPI 4.0.2 installed, which is the latest release: https://www.open-mpi.org/software/ompi/v4.0/ any objections to using that instead of 4.0.1? 4.0.2 fixes a lot of bugs: https://raw.githubusercontent.com/open-mpi/ompi/v4.0.x/NEWS
~@aidanheerdegen commented on slack that openmpi/4.0.1 throws segfaults, and~ Peter D has moved to openmpi/4.0.2 ~for this reason~. So I think we should also use openmpi/4.0.2. Any objections? ping @penguian, @nichannah, @russfiedler
I've changed these
${ACCESS_OM_DIR}/src/mom/bin/environs.nci
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.360x300
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.1440x1080
${ACCESS_OM_DIR}/src/cice5/bld/config.nci.auscom.3600x2700
${ACCESS_OM_DIR}/src/libaccessom2/build_on_gadi.sh
${ACCESS_OM_DIR}/src/libaccessom2/build/oasis3-mct-prefix/src/oasis3-mct/util/make_dir/config.gadi
so that the gadi-transition
branch now uses openMPI4.0.2 in all executables.
The new gadi builds with openMPI4.0.2 are here:
/g/data4/ik11/inputs/access-om2/bin/yatm_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/fms_ACCESS-OM_4a2f211_libaccessom2_575fb04.x
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_365bdc1_libaccessom2_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_365bdc1_libaccessom2_575fb04.exe
/g/data4/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_365bdc1_libaccessom2_575fb04.exe
I haven't tested whether they run.
I do think it is worthwhile upgrading to OpenMPI 4.0.2 before pushing this to master
, but so it is documented and doesn't disappear into the memory hole, Peter D said in yesterday's MOM meeting that the segfaults were with an older version of OpenMPI (3.*?) and these were solved by moving to OpenMPI 4.
So this change does not, a priori, mean more stable performance with the tenth. The 1 and 0.25 degree seem to be fine.
Is there a compelling reason to hard-code the OpenMPI version? I'd suggest keeping it up to date with the latest version so that you don't get surprised as new versions are released and existing ones are deprecated or removed.
It is a very complex collection of code, model configuration and build environment. Keeping the build environment as stable as possible takes out one possible culprit when things stop working.
Just noting that intel-compiler/2020.0.166 is now installed on Gadi, whereas we are using intel-compiler/2019.5.281. Presumably there's no reason to switch to the newer compiler?
I generally wouldn't change things unless necessary. Just adds another possible thing to go wrong.
It is pretty trivial, so I'd suggest bedding down new code/forcing versions and then upgrade that stuff later when comparisons can be easily made.
There is a confounding factor that you may be reluctant to change anything once OMIP style runs have started, so up to you I guess.
are we ready to close this issue now? AFAIK the gadi transition is now complete.
NCI is installing a new peak HPC, called
gadi
The new machine will not support 1.x series OpenMPI, and current builds use
1.10.2
. We will need to migrate to a new version of OpenMPI, which will also require a new version of the intel fortran compiler.This issue is a collection point for information about tests that have been performed so as to not duplicate effort.