COSIMA / libaccessom2

ACCESS-OM2 library
3 stars 7 forks source link

Using the latest executables in an older simulation #72

Open mauricehuguenin opened 3 years ago

mauricehuguenin commented 3 years ago

We (Ryan, me) would like to run some perturbation experiments in ACCESS-OM2-025 using the new perturbations code in libaccessom2 using atmosphere/forcing.json. We would like to branch these simulations off the 650-year 025deg_jra55_ryf spin-up at /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi.

However, this spin-up was performed with old executables (see https://github.com/rmholmes/025deg_jra55_ryf/blob/ryf9091_gadi/config.yaml) that do not contain the new libaccessom2 perturbations code. Unfortunately it looks like the new executables (with libaccessom2 hash _a227a61.exe) aren't backwards compatible with the config files from the old spin-up. Specifically, we get the error: assertion failed: accessom2_sync_config incompatible config between atm and ice: num_atm_to_ice_fields which seems to be linked to the to ice/input_ice.nml that now requires exchanged fields to be specified (e.g. through the fields_from_atm input - and the number of fields no longer matches).

@nichannah @aekiss do you have any suggestions on the best approach to pursue in order to get this working? We would really like to avoid doing another spin-up given the cost and time involved.

One approach might be to create new executables based on those used for the spin-up that only include the new libaccessom2 code involving the perturbations. Another might be to update the config files as much as possible (still using JRA55 v1.3), but still use the old restarts, and hope/evaluate that nothing material to the solution has changed? Any suggestions would be really helpful.

aekiss commented 3 years ago

hmm, yes the latest libaccessom2 is set up for JRA55-do 1.4 which has separate solid and liquid runoff and is incompatible with JRA55-do 1.3. We may need to set up a JRA55-do 1.3 branch for libaccessom2 and cherry-pick the perturbation code changes.

@nichannah does that sound possible?

aekiss commented 3 years ago

@mauricehuguenin your executables are really old - they use libaccessom2 1bb8904 from 10 Dec 2019.

There have been a lot of commits since then, so applying the perturbation code changes could be tricky, but that's really a question for @nichannah.

It looks like the most recent commit supporting JRA55-do v1.3 was f6cf437 from 16 Apr 2020, so that might make a better starting point.

JRA55-do 1.4 support was merged into master at 4198e15 but it looks like this branch also included some unrelated commits.

See https://github.com/COSIMA/libaccessom2/network

rmholmes commented 3 years ago

Thanks @aekiss! Yes the 025deg_jra55_ryf9091_gadi spin-up was started at the end of December 2019, soon after Gadi came online. It would be a pity not to continue to use it given the resources that went into it.

@mauricehuguenin a good starting point might be to try using the f6cf437 libaccessom2 commit to extend the control run. If that works, then we can think about building the more recent perturbations code into that.

mauricehuguenin commented 3 years ago

I fetched the commit from the 16th of April https://github.com/COSIMA/025deg_jra55_ryf/commit/2eb6a35c0b20cf9f7751918cdfa9c221e92ad451 that has changes to atmosphere/forcing.json, config.yaml and ice/ice_input.nml. I then changed to the latest _a227a61 executables as those have the additive forcing functions.

Extending the spin-up with the 2eb6a35c commit works fine, with the latest executables I however get this abort message:

MPI_ABORT was invoked on rank 1550 in communicator MPI_COMM_WORLD
with errorcode 1.

Do the latest .exe files require the licalvf input files? These are currently not in my atmosphere/forcing.json file from the 2eb6a35c commit.

rmholmes commented 3 years ago

@mauricehuguenin I presume your run is the one at /home/561/mv7494/access-om2/025deg_jra55_ryf_ENSOWind/? If so, the error looks like Invalid restart_format: nc. This seems to be a cice error associated with the ice restarts (in https://github.com/COSIMA/cice5/blob/master/io_pio/ice_restart.F90). Something to do with Parallel IO changes??

However, in looking around I also noticed that there are many differences between your configs and the ones used for the spin-up (e.g. /g/data/ik11/outputs/access-om2-025/025deg_jra55_ryf9091_gadi/, or equivalently https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi). E.e. you're using input_236a3011 rather than input_20200530 (although this may not make any difference). To me the best approach would be to start with the configs at https://github.com/rmholmes/025deg_jra55_ryf/tree/ryf9091_gadi and update only that which we need to. In this case; the changes to atmosphere/forcing.json and ice/ice_input.nml in COSIMA/025deg_jra55_ryf@2eb6a35 (and the executables of course).

mauricehuguenin commented 3 years ago

I agree that this is the way to go. With the following changes to Ryan's 025deg_jra55_ryf/ryf9091_gadi spin-up:

In atmosphere/forcing.json:

+      "cname": "runof_ai",
+       "domain": "land"

In config.yaml the latest executables:

+      exe: /g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe
+      exe: /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x
+      exe: /g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_a227a61.exe

In /ice/input_ice.nml:

+    fields_from_atm = 'swfld_i', 'lwfld_i', 'rain_i', 'snow_i', 'press_i', 'runof_i', 'tair_i', 'qair_i', 'uwnd_i', 'vwnd_i'
+    fields_to_ocn = 'strsu_io', 'strsv_io', 'rain_io', 'snow_io', 'stflx_io', 'htflx_io', 'swflx_io', 'qflux_io', 'shflx_io', 'lwflx_io', 'runof_io', 'press_io', 'aice_io', 'melt_io', 'form_io'
+    fields_from_ocn = 'sst_i', 'sss_i', 'ssu_i', 'ssv_i', 'sslx_i', 'ssly_i', 'pfmice_i'
+/

I run into the Invalid restart_format: nc abort. @aekiss Do you maybe know what might happen here? Is it something with the parallelization mentioned by Ryan above https://github.com/COSIMA/libaccessom2/issues/72#issuecomment-952390645?

aidanheerdegen commented 3 years ago

@rmholmes If you want to keep this spin up would an alternate option be to try spinning off a new control with the updated forcing (just use the ocean temp/salt state as the initial conditions), and keep running the control you have for a decade, say, and compare to your new control run. Then compare and see if you're happy that they're broadly similar, or if they are different it is what you'd expect? Or does this not really work as a strategy?

rmholmes commented 3 years ago

@aidanheerdegen that is another option, although changing forcing mid-way through a run is not very clean. If the differences between v1.3 and v1.4 are not significant it may not make a big difference.

@nichannah - it would be great to your opinion on whether minor tweaks to the code to make it backwards-compatible are feasible.

aekiss commented 3 years ago

The default restart format was changed to pio in recent executables. You could try setting restart_format = 'nc' in &setup_nml in ice/cice_in.nml. This will disable parallel IO but that's less important at 0.25deg.

mauricehuguenin commented 3 years ago

Thanks Andrew, this option is already active in ice/cice_in.nml so it might be something else that is causing it.

aekiss commented 3 years ago

Ah ok, that may be the problem - have you tried restart_format = 'pio'?

aekiss commented 3 years ago

FYI I'm in the process of updating the model executables. This will include a fix to a bug in libaccessom2 a227a61.

mauricehuguenin commented 3 years ago

It works! I extended the spin-up by two years and the output is what I expected.

I switched to restart_format = 'pio' in ice/cice_in.nml and also replaced the #Collation and #Misc flags in the config.yaml file with those of the latest https://github.com/COSIMA/025deg_jra55_ryf/commit/2b2be7bb2152229688548fe7d648ef09932f0ae1 commit to avoid segmentation fault errors.

aekiss commented 3 years ago

@mauricehuguenin I've put the latest executables here. It might be good to use these instead as they include a fix to a rounding error bug in libaccessom2. But they are completely untested so I'd be interested to hear if you have any issues with them.

/g/data/ik11/inputs/access-om2/bin/yatm_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM-BGC_6256fdc_libaccessom2_0ab7295.x
/g/data/ik11/inputs/access-om2/bin/cice_auscom_360x300_24p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_18x15.3600x2700_1682p_2572851_libaccessom2_0ab7295.exe
/g/data/ik11/inputs/access-om2/bin/cice_auscom_1440x1080_480p_2572851_libaccessom2_0ab7295.exe
mauricehuguenin commented 2 years ago

I can confirm that these latest executables work with no issues. 👍

rmholmes commented 2 years ago

@mauricehuguenin if this is working - can you close the issue?

mpudig commented 2 years ago

Hi - Ryan and I are attempting to run an RYF9091 ACCESS-OM2-01 simulation that supports relative humidity forcing and the perturbations code. We have used the same executables (but for 1/10-deg) posted above by @aekiss and are restarting from /g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/restart995.

The model is crashing because of what we believe to be a parallel I/O problem in the CICE outputs. The error logs are spitting out, among other things, the following:

ibhdf5.so 000014827787BD11 H5Dlayout_oh_cr Unknown Unknown libhdf5.so 0000148277870EEF H5Dcreate Unknown Unknown libhdf5.so.103.1. 000014827787D455 Unknown Unknown Unknown libhdf5.so 0000148277974C3B H5O_obj_create Unknown Unknown libhdf5.so.103.1. 0000148277938445 Unknown Unknown Unknown libhdf5.so.103.1. 0000148277909FB2 Unknown Unknown Unknown libhdf5.so 000014827790ABA0 H5G_traverse Unknown Unknown libhdf5.so.103.1. 0000148277934D73 Unknown Unknown Unknown libhdf5.so 0000148277939B72 H5L_link_object Unknown Unknown libhdf5.so 000014827786E574 H5Dcreate_named Unknown Unknown libhdf5.so 0000148277849473 H5Dcreate2 Unknown Unknown libnetcdf.so.18.0 000014827B26BBBD Unknown Unknown Unknown libnetcdf.so.18.0 000014827B26D099 Unknown Unknown Unknown libnetcdf.so 000014827B26D854 nc4_rec_write_met Unknown Unknown libnetcdf.so.18.0 000014827B26FADF Unknown Unknown Unknown libnetcdf.so 000014827B27061D nc4_enddef_netcdf Unknown Unknown libnetcdf.so.18.0 000014827B270180 Unknown Unknown Unknown libnetcdf.so 000014827B27009D NC4enddef Unknown Unknown libnetcdf.so 000014827B2193EB nc_enddef Unknown Unknown cice_auscom_3600x 000000000093A87F Unknown Unknown Unknown cice_auscom_3600x 00000000006ADFAC ice_history_write 947 ice_history_write.f90 cice_auscom_3600x 000000000066699F ice_history_mp_ac 2023 ice_history.f90 cice_auscom_3600x 00000000004165C5 cice_runmod_mp_ci 411 CICE_RunMod.f90 cice_auscom_3600x 0000000000411212 MAIN 70 CICE.f90 cice_auscom_3600x 00000000004111A2 Unknown Unknown Unknown libc-2.28.so 0000148279999493 libc_start_main Unknown Unknown cice_auscom_3600x 00000000004110AE Unknown Unknown Unknown

The model was crashing at the end of the first month when certain icefields_nml fields in cice_in.nml were set to 'm', and crashing at the end of the first day when set to 'd', so we are fairly confident the issue is coming from CICE.

It would be great if someone could have a look at this to see what is going wrong. My files are at /home/561/mp2135/access-om2/01deg_jra55_ryf_cont/ and all my changes have been pushed here: https://github.com/mpudig/01deg_jra55_ryf/tree/v13_rcpcont.

Thanks!

aidanheerdegen commented 2 years ago

The stack traces point to different builds, don't know if that is relevant, but if they're built against different MPI and/or PIO/netCDF/HDF5 libraries it might be problematic:

34 0x0000000000933ade pioc_change_def()  /home/156/aek156/github/COSIMA/access-om2-new/src/cice5/ParallelIO/src/clib/pioc_support.c:2985
35 0x00000000006ae0ec ice_history_write_mp_ice_write_hist_.V()  /home/156/aek156/github/COSIMA/access-om2/src/cice5/build_auscom_3600x2700_722p/ice_history_write.f90:947

So specifically /home/156/aek156/github/COSIMA/access-om2-new/ and /home/156/aek156/github/COSIMA/access-om2

aekiss commented 2 years ago

Does this configuration work with other executables?

russfiedler commented 2 years ago

I also note that the status of the various pio_... calls is hardly ever checked before the pio_enddef call that finally fails is called. Naughty programmers! Anyway, it's dying in ROMIO

mpudig commented 2 years ago

@aekiss, yes, we ran it with

/g/data/ik11/inputs/access-om2/bin/yatm_a227a61.exe /g/data/ik11/inputs/access-om2/bin/fms_ACCESS-OM_af3a94d_libaccessom2_a227a61.x /g/data/ik11/inputs/access-om2/bin/cice_auscom_3600x2700_722p_2572851_libaccessom2_a227a61.exe

originally and it crashed at the end of the first month as well.

aekiss commented 2 years ago

Are these the versions you need to use, or would something else be more ideal? If so I could try compiling that.

mpudig commented 2 years ago

Those are the latest ones we have used (https://github.com/mpudig/01deg_jra55_ryf/blob/v13_rcpcont/config.yaml) and the issue is still occurring with them.

I should maybe add too that these executables worked when running the 1/4-degree configuration (with relative humidity forcing)!

russfiedler commented 2 years ago

You haven't set the correct switches for using PIO in the mpirun command in config.yaml

e.g. mpirun: --mca io ompio --mca io_ompio_num_aggregators 1 Also you want to set the UCX_LOG_LEVEL See, for example

/g/data/ik11/outputs//access-om2-01/01deg_jra55v140_iaf_cycle4/output830/config.yaml

rmholmes commented 2 years ago

Ah thanks @russfiedler, that indeed looks promising. @mpudig can you try again including all the options between # Misc and userscripts in the config.yaml that Russ has listed above? Don't add the userscripts because currently those aren't in your config directory.

rmholmes commented 2 years ago

Also - I guess it would be best to remove the specification of openmpi/4.0.1 as that could clash with the versions used for compilation?

russfiedler commented 2 years ago

There could be other things in the cice_in.nml file that might need checking for PIO use. You're probably right about that openmpi version. I'm not sure why it's there or what its affect is.

aidanheerdegen commented 2 years ago

Specifying modules like that overrides the automatic discovery using ldd, which is what mpirun does too I believe. Yes it is best not to do that, and just let it find the right one to use.

rmholmes commented 2 years ago

I've compared the cice_in.nml files (see /scratch/e14/rmh561/diff_cice_in.nml). The only differences I see that could be relevant are the history_chunksize ones - do these need to be specificed for the parallel I/O?

aekiss commented 2 years ago

yes, Nic added these for parallel IO

aekiss commented 2 years ago

might be worth comparing your whole config with https://github.com/COSIMA/01deg_jra55_ryf to see if there's anything else amiss

mpudig commented 2 years ago

Hi, thanks all for your comments the other day. Implementing Russ's comments on including mpirun: --mca io ompio --mca io_ompio_num_aggregators 1 in config.yaml and Ryan's on adding history_chunksize to cice_in.nml fixed the original issue: the model ran successfully past month 1 and completed a full 3-month simulation.

However, the output has troubled us slightly. Comparing to the ik11 run over the same period (/g/data/ik11/outputs/access-om2-01/01deg_jra55v13_ryf9091/output996) there seem to be some physical differences between sea ice and some other variables. I'm attaching plots of the global average salt in my run and the ik11 run, as well as the difference in sea ice concentration between my run and the ik11 run. There seems to be systematically more sea ice in my run than the ik11 run. My run is sitting at /scratch/e14/mp2135/access-om2/archive/01deg_jra55_ryf_cont/.

compare_sea_ice_conc comparing_salt

We can't see any major changes in ice configs between our run and the ik11 run. However, there are lots of changes in the CICE executable between commits 2572851 and d3e8bdf which seem mostly to do with parallel I/O and WOMBAT. Do you think the (small) changes we are seeing are realistic with these executable changes, or has something gone awry?

mauricehuguenin commented 2 years ago

I can see that the run on ik11 uses additional input for mom:

input:
          - /g/data/ik11/inputs/access-om2/input_08022019/mom_01deg
          - /g/data/x77/amh157/passive/passive4

Matt is running without these passive fields on /g/data/x77. Is this input maybe causing the difference in the global fields? Unfortunately I am not a member of x77 and cannot have a look at the fields.

rmholmes commented 2 years ago

@mauricehuguenin that's just a passive tracer that Andy had included in the original control run. It won't influence the physics.

aekiss commented 2 years ago

Hmm, that seems surprising to me. Have you carefully checked all your .nml files? nmltab can make this easier: https://github.com/aekiss/nmltab

aekiss commented 2 years ago

You're using all the same input files, right?

mpudig commented 2 years ago

One difference is that we use RYF.r_10.1990_1991.nc instead of RYF.q_10.1990_1991.nc as an atmospheric input field. But since no perturbation has been applied this shouldn't change things substantially. I think @rmholmes has tested this pretty extensively.

There are a few differences between some .nml files. I assume they're mostly because of various updates since the ik11 simulation was run (but maybe not...?):

In ocean/input.nml:

  • The ik11 run has max_axes = 100 under &diag_manager_nml, whereas my run doesn't.

In ice/input_ice.nml

  • My run has fields_from_atm, fields_to_ocn and fields_from_ocn options, whereas the ik11 run doesn't.

In ice/cice_in.nml

  • My run has istep0 = 0, whereas the ik11 run has istep0 = 6454080. (Does this seem strange?!)
  • My run has runtype = 'initial', whereas the ik11 run has runtype = 'continue'.
  • My run has restart = .false., whereas the ik11 run has restart = .true..
  • My run has restart_format = 'pio', whereas the ik11 run has restart = 'nc'.
  • My run has history_chunksize_x and _y (per Ryan's comment above).
rmholmes commented 2 years ago

In addition to Matt's comments above, yes we're using the same inputs (input_08022019).