Open mauricehuguenin opened 3 years ago
hmm, yes the latest libaccessom2 is set up for JRA55-do 1.4 which has separate solid and liquid runoff and is incompatible with JRA55-do 1.3. We may need to set up a JRA55-do 1.3 branch for libaccessom2 and cherry-pick the perturbation code changes.
@nichannah does that sound possible?
@mauricehuguenin your executables are really old - they use libaccessom2 1bb8904 from 10 Dec 2019.
There have been a lot of commits since then, so applying the perturbation code changes could be tricky, but that's really a question for @nichannah.
It looks like the most recent commit supporting JRA55-do v1.3 was f6cf437 from 16 Apr 2020, so that might make a better starting point.
JRA55-do 1.4 support was merged into master at 4198e15 but it looks like this branch also included some unrelated commits.
Thanks @aekiss! Yes the 025deg_jra55_ryf9091_gadi
spin-up was started at the end of December 2019, soon after Gadi came online. It would be a pity not to continue to use it given the resources that went into it.
@mauricehuguenin a good starting point might be to try using the f6cf437
libaccessom2 commit to extend the control run. If that works, then we can think about building the more recent perturbations code into that.
I fetched the commit from the 16th of April https://github.com/COSIMA/025deg_jra55_ryf/commit/2eb6a35c0b20cf9f7751918cdfa9c221e92ad451 that has changes to atmosphere/forcing.json
, config.yaml
and ice/ice_input.nml
. I then changed to the latest _a227a61
executables as those have the additive forcing functions.
Extending the spin-up with the 2eb6a35c
commit works fine, with the latest executables I however get this abort message:
MPI_ABORT was invoked on rank 1550 in communicator MPI_COMM_WORLD
with errorcode 1.
Do the latest .exe files require the licalvf
input files? These are currently not in my atmosphere/forcing.json
file from the 2eb6a35c
commit.
@mauricehuguenin I presume your run is the one at /home/561/mv7494/access-om2/025deg_jra55_ryf_ENSOWind/
? If so, the error looks like Invalid restart_format: nc
. This seems to be a cice error associated with the ice restarts (in https://github.com/COSIMA/cice5/blob/master/io_pio/ice_restart.F90). Something to do with Parallel IO changes??
However, in looking around I also noticed that there are many differences between your configs and the ones used for the spin-up (e.g. /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/
, or equivalently https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi). E.e. you're using input_236a3011
rather than input_20200530
(although this may not make any difference). To me the best approach would be to start with the configs at https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi and update only that which we need to. In this case; the changes to atmosphere/forcing.json
and ice/ice_input.nml
in COSIMA/025deg_jra55_ryf@2eb6a35 (and the executables of course).
I agree that this is the way to go. With the following changes to Ryan's 025deg_jra55_ryf/ryf9091_gadi spin-up:
In atmosphere/forcing.json
:
+ "cname": "runof_ai",
+ "domain": "land"
In config.yaml
the latest executables:
+ exe: /g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
+ exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
+ exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_a227a61.exe
In /ice/input_ice.nml
:
+ fields_from_atm = 'swfld_i', 'lwfld_i', 'rain_i', 'snow_i', 'press_i', 'runof_i', 'tair_i', 'qair_i', 'uwnd_i', 'vwnd_i'
+ fields_to_ocn = 'strsu_io', 'strsv_io', 'rain_io', 'snow_io', 'stflx_io', 'htflx_io', 'swflx_io', 'qflux_io', 'shflx_io', 'lwflx_io', 'runof_io', 'press_io', 'aice_io', 'melt_io', 'form_io'
+ fields_from_ocn = 'sst_i', 'sss_i', 'ssu_i', 'ssv_i', 'sslx_i', 'ssly_i', 'pfmice_i'
+/
I run into the Invalid restart_format: nc
abort. @aekiss Do you maybe know what might happen here? Is it something with the parallelization mentioned by Ryan above https://github.com/COSIMA/libaccessom2/issues/72#issuecomment-952390645?
@rmholmes If you want to keep this spin up would an alternate option be to try spinning off a new control with the updated forcing (just use the ocean temp/salt state as the initial conditions), and keep running the control you have for a decade, say, and compare to your new control run. Then compare and see if you're happy that they're broadly similar, or if they are different it is what you'd expect? Or does this not really work as a strategy?
@aidanheerdegen that is another option, although changing forcing mid-way through a run is not very clean. If the differences between v1.3 and v1.4 are not significant it may not make a big difference.
@nichannah - it would be great to your opinion on whether minor tweaks to the code to make it backwards-compatible are feasible.
The default restart format was changed to pio
in recent executables.
You could try setting restart_format = 'nc'
in &setup_nml
in ice/cice_in.nml
.
This will disable parallel IO but that's less important at 0.25deg.
Thanks Andrew, this option is already active in ice/cice_in.nml
so it might be something else that is causing it.
Ah ok, that may be the problem - have you tried restart_format = 'pio'
?
FYI I'm in the process of updating the model executables. This will include a fix to a bug in libaccessom2 a227a61.
It works! I extended the spin-up by two years and the output is what I expected.
I switched to restart_format = 'pio'
in ice/cice_in.nml
and also replaced the #Collation
and #Misc
flags in the config.yaml
file with those of the latest https://github.com/COSIMA/025deg_jra55_ryf/commit/2b2be7bb2152229688548fe7d648ef09932f0ae1 commit to avoid segmentation fault errors.
@mauricehuguenin I've put the latest executables here. It might be good to use these instead as they include a fix to a rounding error bug in libaccessom2. But they are completely untested so I'd be interested to hear if you have any issues with them.
/g/data/ik11/inputs/access-om2/bin/yatm_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM-BGC_6256fdc_libaccessom2_0ab7295.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_18x15.3600x2700_1682p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_0ab7295.exe
I can confirm that these latest executables work with no issues. 👍
@mauricehuguenin if this is working - can you close the issue?
Hi - Ryan and I are attempting to run an RYF9091 ACCESS-OM2-01 simulation that supports relative humidity forcing and the perturbations code. We have used the same executables (but for 1/10-deg) posted above by @aekiss and are restarting from /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/restart995
.
The model is crashing because of what we believe to be a parallel I/O problem in the CICE outputs. The error logs are spitting out, among other things, the following:
ibhdf5.so 000014827787BD11 H5Dlayout_oh_cr Unknown Unknown libhdf5.so 0000148277870EEF H5Dcreate Unknown Unknown libhdf5.so.103.1. 000014827787D455 Unknown Unknown Unknown libhdf5.so 0000148277974C3B H5O_obj_create Unknown Unknown libhdf5.so.103.1. 0000148277938445 Unknown Unknown Unknown libhdf5.so.103.1. 0000148277909FB2 Unknown Unknown Unknown libhdf5.so 000014827790ABA0 H5G_traverse Unknown Unknown libhdf5.so.103.1. 0000148277934D73 Unknown Unknown Unknown libhdf5.so 0000148277939B72 H5L_link_object Unknown Unknown libhdf5.so 000014827786E574 H5Dcreate_named Unknown Unknown libhdf5.so 0000148277849473 H5Dcreate2 Unknown Unknown libnetcdf.so.18.0 000014827B26BBBD Unknown Unknown Unknown libnetcdf.so.18.0 000014827B26D099 Unknown Unknown Unknown libnetcdf.so 000014827B26D854 nc4_rec_write_met Unknown Unknown libnetcdf.so.18.0 000014827B26FADF Unknown Unknown Unknown libnetcdf.so 000014827B27061D nc4_enddef_netcdf Unknown Unknown libnetcdf.so.18.0 000014827B270180 Unknown Unknown Unknown libnetcdf.so 000014827B27009D NC4enddef Unknown Unknown libnetcdf.so 000014827B2193EB nc_enddef Unknown Unknown cice_auscom_3600x 000000000093A87F Unknown Unknown Unknown cice_auscom_3600x 00000000006ADFAC ice_history_write 947 ice_history_write.f90 cice_auscom_3600x 000000000066699F ice_history_mp_ac 2023 ice_history.f90 cice_auscom_3600x 00000000004165C5 cice_runmod_mp_ci 411 CICE_RunMod.f90 cice_auscom_3600x 0000000000411212 MAIN 70 CICE.f90 cice_auscom_3600x 00000000004111A2 Unknown Unknown Unknown libc-2.28.so 0000148279999493 libc_start_main Unknown Unknown cice_auscom_3600x 00000000004110AE Unknown Unknown Unknown
The model was crashing at the end of the first month when certain icefields_nml
fields in cice_in.nml
were set to 'm'
, and crashing at the end of the first day when set to 'd'
, so we are fairly confident the issue is coming from CICE.
It would be great if someone could have a look at this to see what is going wrong. My files are at /home/561/mp2135/access-om2/01deg_jra55_ryf_cont/
and all my changes have been pushed here: https://github.com/mpudig/01deg_jra55_ryf/tree/v13_rcpcont.
Thanks!
The stack traces point to different builds, don't know if that is relevant, but if they're built against different MPI and/or PIO/netCDF/HDF5 libraries it might be problematic:
34 0x0000000000933ade pioc_change_def() /home/156/aek156/github/COSIMA/access-om2-new/src/cice5/ParallelIO/src/clib/pioc_support.c:2985
35 0x00000000006ae0ec ice_history_write_mp_ice_write_hist_.V() /home/156/aek156/github/COSIMA/access-om2/src/cice5/build_auscom_3600x2700_722p/ice_history_write.f90:947
So specifically /home/156/aek156/github/COSIMA/access-om2-new/
and /home/156/aek156/github/COSIMA/access-om2
Does this configuration work with other executables?
I also note that the status
of the various pio_...
calls is hardly ever checked before the pio_enddef
call that finally fails is called. Naughty programmers!
Anyway, it's dying in ROMIO
@aekiss, yes, we ran it with
/g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_a227a61.exe
originally and it crashed at the end of the first month as well.
Are these the versions you need to use, or would something else be more ideal? If so I could try compiling that.
Those are the latest ones we have used (https://github.com/mpudig/01deg_jra55_ryf/blob/v13_rcpcont/config.yaml) and the issue is still occurring with them.
I should maybe add too that these executables worked when running the 1/4-degree configuration (with relative humidity forcing)!
You haven't set the correct switches for using PIO in the mpirun
command in config.yaml
e.g.
mpirun: --mca io ompio --mca io_ompio_num_aggregators 1
Also you want to set the UCX_LOG_LEVEL
See, for example
/g/data/ik11/outputs//access-om2-01/01deg_jra55v140_iaf_cycle4/output830/config.yaml
Ah thanks @russfiedler, that indeed looks promising. @mpudig can you try again including all the options between # Misc
and userscripts
in the config.yaml
that Russ has listed above? Don't add the userscripts
because currently those aren't in your config directory.
Also - I guess it would be best to remove the specification of openmpi/4.0.1
as that could clash with the versions used for compilation?
There could be other things in the cice_in.nml
file that might need checking for PIO use.
You're probably right about that openmpi version. I'm not sure why it's there or what its affect is.
Specifying modules like that overrides the automatic discovery using ldd
, which is what mpirun
does too I believe. Yes it is best not to do that, and just let it find the right one to use.
I've compared the cice_in.nml
files (see /scratch/e14/rmh561/diff_cice_in.nml
). The only differences I see that could be relevant are the history_chunksize
ones - do these need to be specificed for the parallel I/O?
yes, Nic added these for parallel IO
might be worth comparing your whole config with https://github.com/COSIMA/01deg_jra55_ryf to see if there's anything else amiss
Hi, thanks all for your comments the other day. Implementing Russ's comments on including mpirun: --mca io ompio --mca io_ompio_num_aggregators 1
in config.yaml
and Ryan's on adding history_chunksize
to cice_in.nml
fixed the original issue: the model ran successfully past month 1 and completed a full 3-month simulation.
However, the output has troubled us slightly. Comparing to the ik11
run over the same period (/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output996
) there seem to be some physical differences between sea ice and some other variables. I'm attaching plots of the global average salt in my run and the ik11 run, as well as the difference in sea ice concentration between my run and the ik11 run. There seems to be systematically more sea ice in my run than the ik11 run. My run is sitting at /scratch/e14/mp2135/access-om2/archive/01deg_jra55_ryf_cont/
.
We can't see any major changes in ice configs between our run and the ik11 run. However, there are lots of changes in the CICE executable between commits 2572851
and d3e8bdf
which seem mostly to do with parallel I/O and WOMBAT. Do you think the (small) changes we are seeing are realistic with these executable changes, or has something gone awry?
I can see that the run on ik11
uses additional input for mom:
input:
- /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
- /g/data/x77/amh157/passive/passive4
Matt is running without these passive fields on /g/data/x77
. Is this input maybe causing the difference in the global fields? Unfortunately I am not a member of x77
and cannot have a look at the fields.
@mauricehuguenin that's just a passive tracer that Andy had included in the original control run. It won't influence the physics.
Hmm, that seems surprising to me. Have you carefully checked all your .nml files? nmltab can make this easier: https://github.com/aekiss/nmltab
You're using all the same input files, right?
One difference is that we use RYF.r_10.1990_1991.nc
instead of RYF.q_10.1990_1991.nc
as an atmospheric input field. But since no perturbation has been applied this shouldn't change things substantially. I think @rmholmes has tested this pretty extensively.
There are a few differences between some .nml files. I assume they're mostly because of various updates since the ik11 simulation was run (but maybe not...?):
In
ocean/input.nml
:
- The ik11 run has
max_axes = 100
under&diag_manager_nml
, whereas my run doesn't.In
ice/input_ice.nml
- My run has
fields_from_atm
,fields_to_ocn
andfields_from_ocn
options, whereas the ik11 run doesn't.In
ice/cice_in.nml
- My run has
istep0 = 0
, whereas the ik11 run hasistep0 = 6454080
. (Does this seem strange?!)- My run has
runtype = 'initial'
, whereas the ik11 run hasruntype = 'continue'
.- My run has
restart = .false.
, whereas the ik11 run hasrestart = .true.
.- My run has
restart_format = 'pio'
, whereas the ik11 run hasrestart = 'nc'
.- My run has
history_chunksize_x
and_y
(per Ryan's comment above).
In addition to Matt's comments above, yes we're using the same inputs (input_08022019
).
We (Ryan, me) would like to run some perturbation experiments in ACCESS-OM2-025 using the new perturbations code in
libaccessom2
usingatmosphere/forcing.json
. We would like to branch these simulations off the 650-year025deg_jra55_ryf
spin-up at/g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi
.However, this spin-up was performed with old executables (see https://github.com/rmholmes/025deg_jra55_ryf/blob/ryf9091_gadi/config.yaml) that do not contain the new
libaccessom2
perturbations code. Unfortunately it looks like the new executables (withlibaccessom2
hash_a227a61.exe
) aren't backwards compatible with the config files from the old spin-up. Specifically, we get the error:assertion failed: accessom2_sync_config incompatible config between atm and ice: num_atm_to_ice_fields
which seems to be linked to the toice/input_ice.nml
that now requires exchanged fields to be specified (e.g. through thefields_from_atm
input - and the number of fields no longer matches).@nichannah @aekiss do you have any suggestions on the best approach to pursue in order to get this working? We would really like to avoid doing another spin-up given the cost and time involved.
One approach might be to create new executables based on those used for the spin-up that only include the new
libaccessom2
code involving the perturbations. Another might be to update the config files as much as possible (still using JRA55 v1.3), but still use the old restarts, and hope/evaluate that nothing material to the solution has changed? Any suggestions would be really helpful.