Closed aekiss closed 1 year ago
For strict bit-for-bit reproducibility srcTermProcessing=1
and termOrder=srcseq
are required in nuopc.runseq
. See details here and here.
_[edit: setting this in nuopc.runseq
actually isn't necessary for reproducibility]_
These runs have identical initial conditions (cold start), identical inputs, parameters and executables:
$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/output000/manifests/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/output000/manifests/
$
but the resulting restarts for cice, coupler and mom6 differ (whereas the datm and drof restarts are identical):
$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ
repro_test_3
and repro_test_4
confirm lack of reproducibility with latest debug build /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e
$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ
repro_test_5
and repro_test_6
also don't reproduce, despite using srcTermProcessing=1:termOrder=srcseq
in nuopc.runseq
as described here, which is supposed to provide the strictest bit-for-bit reproducibility in remapping.
$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ
I'm not sure what to try next.
nuopc.runseq
?Bit-for-bit reproducibility is discussed in ESMF docs:
and is relevant to many subroutines, e.g.
etc etc - search for "bit-for-bit"
Couldn't run for 1 timestep with these settings in nuopc.runconfig
restart_n = 1
restart_option = nsteps
...
stop_n = 1
stop_option = nsteps
due to a segmentation fault.
[gadi-cpu-clx-0432:475851:0:475851] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6630)
==== backtrace (tid: 475877) ====
0 0x0000000000012cf0 __funlockfile() :0
1 0x0000000001bb12ee mom_mp_mom_state_is_synchronized_() /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/src/core/MOM.F90:3830
2 0x0000000001a7965d mom_ocean_model_nuopc_mp_ocean_model_restart_() /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_ocean_model_nuopc.F90:723
3 0x0000000001a0b784 mom_cap_mod_mp_modeladvance_() /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_cap.F90:1690
4 0x0000000000fd0938 ESMCI::MethodElement::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
5 0x0000000000fd089a ESMCI::MethodTable::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
6 0x0000000000fcf462 c_esmc_methodtableexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
7 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
8 0x00000000069db2cd nuopc_modelbase_mp_routine_run_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2220
9 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
10 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
11 0x0000000000d565aa ESMCI::VMK::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
12 0x0000000001117c72 ESMCI::VM::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
13 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
14 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
15 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
16 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
17 0x00000000006956fc nuopc_driver_mp_executerunsequence_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
18 0x0000000000fd0938 ESMCI::MethodElement::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
19 0x0000000000fd089a ESMCI::MethodTable::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
20 0x0000000000fcf462 c_esmc_methodtableexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
21 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
22 0x0000000000692052 nuopc_driver_mp_routine_run_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
23 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
24 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
25 0x0000000000d565aa ESMCI::VMK::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
26 0x0000000001117c72 ESMCI::VM::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
27 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
28 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
29 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
30 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
31 0x00000000006956fc nuopc_driver_mp_executerunsequence_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
32 0x0000000000fd0938 ESMCI::MethodElement::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
33 0x0000000000fd089a ESMCI::MethodTable::execute() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
34 0x0000000000fcf462 c_esmc_methodtableexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
35 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
36 0x0000000000692052 nuopc_driver_mp_routine_run_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
37 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
38 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
39 0x0000000000d565aa ESMCI::VMK::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
40 0x0000000001117c72 ESMCI::VM::enter() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
41 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
42 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
43 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_() /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
44 0x0000000000431bb4 MAIN__() /g/data/v45/aek156/access-om3-build/access-om3/CMEPS/CMEPS/cesm/driver/esmApp.F90:141
45 0x0000000000430d62 main() ???:0
46 0x000000000003ad85 __libc_start_main() ???:0
47 0x0000000000430c6e _start() ???:0
=================================
I was able to do a 2-timestep run. The same 3 restarts still differ:
$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc differ
@aekiss I haven't checked my AMIP runs but I'd be very surprised if they were reproducible! Can you share your nuopc.runseq and nuopc.runconfig?
I think the nuopc.runseq
flags only effect the transfer of data between the components and the mediator. These are just copying data (no regridding involved) so should be bit reproducible anyway. All the actual regridding is happening in CMEPs so we'd have to convince CMEPs to do bit reproducible regridding. Not sure if this is possible or not, I'll have a bit of a dig around later today.
Thanks @kieranricardo, I hadn't realised that about the nuopc.runseq
flags. Having bit reproducibility is really important, e.g. so we can re-run sections of an experiment with different outputs or do regression testing.
Although MOM6 and CICE6 have the same grid dimensions (lon x lat = 320x384), regridding is needed because MOM6 is C-grid and we are using B-grid CICE6 (at present). The JRA55 data stream has different grid dimensions (640x320) so is also regridded.
When we switch to C-grid CICE6 there will be no need to regrid to couple with MOM6, so any reproducibility issues there should disappear.
The mediator log med.log
shows the regridding method used for each field:
$ grep '^ mapping' /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/output000/log/med.log
mapping atm->ocn Sa_u via patch_uv3d with one normalization
mapping atm->ocn Sa_v via patch_uv3d with one normalization
mapping atm->ocn Sa_z via bilnr with one normalization
mapping atm->ocn Sa_tbot via bilnr with one normalization
mapping atm->ocn Sa_pbot via bilnr with one normalization
mapping atm->ocn Sa_shum via bilnr with one normalization
mapping atm->ocn Sa_ptem via bilnr with one normalization
mapping atm->ocn Sa_dens via bilnr with one normalization
mapping atm->ocn Faxa_swnet via consf with one normalization
mapping atm->ocn Faxa_rainc via consf with one normalization
mapping atm->ocn Faxa_rainl via consf with one normalization
mapping atm->ocn Faxa_snowc via consf with one normalization
mapping atm->ocn Faxa_snowl via consf with one normalization
mapping atm->ocn Faxa_lwdn via consf with one normalization
mapping atm->ocn Faxa_swndr via consf with one normalization
mapping atm->ocn Faxa_swvdr via consf with one normalization
mapping atm->ocn Faxa_swndf via consf with one normalization
mapping atm->ocn Faxa_swvdf via consf with one normalization
mapping atm->ocn Sa_pslv via bilnr with one normalization
mapping atm->ice Sa_u via patch_uv3d with one normalization
mapping atm->ice Sa_v via patch_uv3d with one normalization
mapping atm->ice Sa_z via bilnr with one normalization
mapping atm->ice Sa_tbot via bilnr with one normalization
mapping atm->ice Sa_pbot via bilnr with one normalization
mapping atm->ice Sa_shum via bilnr with one normalization
mapping atm->ice Sa_ptem via bilnr with one normalization
mapping atm->ice Sa_dens via bilnr with one normalization
mapping atm->ice Faxa_swnet via consf with one normalization
mapping atm->ice Faxa_rainc via consf with one normalization
mapping atm->ice Faxa_rainl via consf with one normalization
mapping atm->ice Faxa_snowc via consf with one normalization
mapping atm->ice Faxa_snowl via consf with one normalization
mapping atm->ice Faxa_lwdn via consf with one normalization
mapping atm->ice Faxa_swndr via consf with one normalization
mapping atm->ice Faxa_swvdr via consf with one normalization
mapping atm->ice Faxa_swndf via consf with one normalization
mapping atm->ice Faxa_swvdf via consf with one normalization
mapping atm->ice Faxa_bcph via consf with one normalization
mapping atm->ice Faxa_dstwet via consf with one normalization
mapping atm->ice Faxa_dstdry via consf with one normalization
mapping atm->ice Sa_pslv via bilnr with one normalization
mapping ocn->ice So_omask via fcopy
mapping ocn->ice So_t via fcopy
mapping ocn->ice So_s via fcopy
mapping ocn->ice So_u via fcopy
mapping ocn->ice So_v via fcopy
mapping ocn->ice So_dhdx via fcopy
mapping ocn->ice So_dhdy via fcopy
mapping ocn->ice Fioo_q via fcopy
mapping ice->ocn Faii_swnet via fcopy
mapping ice->ocn Si_ifrac via fcopy
mapping ice->ocn Fioi_swpen via fcopy
mapping ice->ocn Fioi_swpen_vdr via fcopy
mapping ice->ocn Fioi_swpen_vdf via fcopy
mapping ice->ocn Fioi_swpen_idr via fcopy
mapping ice->ocn Fioi_swpen_idf via fcopy
mapping ice->ocn Fioi_melth via fcopy
mapping ice->ocn Fioi_taux via fcopy
mapping ice->ocn Fioi_tauy via fcopy
mapping ice->ocn Fioi_meltw via fcopy
mapping ice->ocn Fioi_salt via fcopy
mapping rof->ocn Forr_rofl via rof2ocn_liq with none normalization
mapping rof->ocn Forr_rofi via rof2ocn_ice with none normalization
this seems odd - I expected some regridding from C to B grid
mapping ocn->ice So_u via fcopy
mapping ocn->ice So_v via fcopy
@aekiss annoyingly CMEPS only supports one grid/mesh per each component. For the UM cap we only export fields on the density/pressure points, and the mapping from the velocity points to the density points happens inside the CAP. Obviously this is a little suboptimal with some fields going UM v points -> UM p points -> MOM p points -> MOM v points
. I think the MOM and CICE caps must be doing the same thing although I haven't found this in the code.
Thanks for clarifying. That is consistent with what I understood, that the MOM6-CICE6 coupling takes place on the A grid. I had the impression that work is underway to support direct MOM6-CICE6 coupling on the C grid, but that would involve supporting more than one grid per component.
Info on bitwise reproducibility in MOM6: https://github.com/NOAA-GFDL/MOM6-examples/wiki/Developers-guide#debugging
The CMEPS driver CMEPS/CMEPS/cesm/driver/esm.F90
apparently supports reproducible summation implemented in share/CESM_share/src/shr_reprosum_mod.F90
.
In CICE6 we use kvp=1
(EVP rheology), which is the only dynamics option that's bit-for-bit reproducible.
CICE6 supports reproducible sums depending on the setting of bfbflag
, but this only affects global diagnostics written to the CICE log file, not the prognostic variables which are bit-for-bit identical with any bfbflag
. We use bfbflag = "off"
, so global diagnostics won't be reproducible.
@kieranricardo following up on your earlier comment, my nuopc.runseq
and nuopc.runconfig
are in the access_exe
branch here.
@kieranricardo well I'm stumped. It's not reproducible with 1 CPU for all components. I expect I'm missing something obvious...
Is there a NUOPC option to save all the coupling fields to a file (preferably both sides of the interpolation).
Dave added this to CM2 and it's been very useful.
@aekiss what?! That's bizarre.... Could MOM or CICE be using threads anywhere? I'll have a closer look and see if CMEPS or CDEPS is.
@aekiss can you run for less than one coupling time step (not 100% sure if this possible) just to verify that it's the coupling causing the issues? Hopefully that'll be reproducible.
It might also be worth logging the number of OMP threads, if something is setting them > 1 then CICE at least will be parallel which might make the cap non-reproducible.
@kieranricardo only 1 thread is being used
$ grep OMP_ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_9/output000/env.yaml
OMP_NUM_THREADS: '1'
not sure if we can run for less than 1 coupling timestep, but I'll look into it
I tried running with this nuopc.runseq
in the hope that it would be an uncoupled run, but it aborted after apparently initialising all components
runSeq::
@3600
ICE
ROF
OCN
ATM
@
::
Thanks for the suggestion @MartinDix - I enabled writing some ATM->MED coupler output every timestep with this commit.
The resulting files
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.avrg.0001-01-01-03600.nc
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.inst.0001-01-01-03600.nc
span the full single-precision range and look like nonsense (though not random), e.g.
(I've restricted the range to [-1, 1] for plotting purposes)
Is this some sort of type conversion error?
Or uninitialised arrays? Though it doesn't look random enough (similar patterns appear for all variables and time steps) and I'm using the debug executable /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e
which I'd hope would prevent access to uninitialised memory.
In any case, these files are identical when I re-run, so they aren't related to the non-reproducibility.
It does seem to be some sort of 64/32 bit mixup. I get some plausible values by treating pairs of 32 bit values as 64 bit.
atmImp_lon[0,0,:]
[0. , 3.8515625, 0. , 1.984375 , 0. , 2.21875 ,
0. , 2.3515625, 0. , 2.46875 , 0. , 2.5429688,
.....
0. , 3.5905762, 0. , 3.5942383, 0. , 3.5979004],
After conversion
[360. , 1.875, 3.75 , 5.625, 7.5 , 9.375, 11.25 ,
...
354.375, 356.25 , 358.125])
Fields also look reasonable, though there's now only half the grid. This is plausibly some sort of surface SW radiation.
Thanks @MartinDix, that looks a whole lot better! It's swvdr
which sounds (and looks) like downwelling shortwave
I split this off as a separate issue https://github.com/COSIMA/access-om3/issues/44
@micaeljtoliveira this notebook compares the restarts from these 1-CPU, 2-timestep runs and plots the differing fields at the surface at the final time - see renders here; the difference scales are 10% of the full range.
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_9
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_10
The CICE differences seem confined to the Arctic, e.g.
whereas the MOM differences are in a bunch of scattered patches, e.g.
Coupler field differences look similar to the source model component (CICE or MOM).
The script also provides a table of max and min differences and relative differences for each field. Many of the relative differences are very small, but some are very large, especially considering it's only 2 timesteps (e.g. a factor of over 100 for stressm_2
, but maybe that's due to ice being present in one case and absent in the other, as the fig of stressm_2
above generally doesn't look that dramatic).
At least the plots show that the gross data corruption in the mediator diagnostics is diagnostic-only and apparently not affecting the run itself.
One day runs with Kieran's um-mom-cice configuration are reproducible.
Ah, interesting! Thanks for checking. Does that narrow the problem down to DATM, DROF and/or their coupling to the mediator?
It's possibly relevant that these are the only components of our MOM6-CICE6 config that involve regridding to a different resolution (though you're doing this with the UM too).
@micaeljtoliveira oops, correction - it was the mediator diagnostics I saved every timestep, but I actually only saved restarts at the end of the 2nd timestep. To save them every timestep, set restart_n = 1
in nuopc.runconfig
.
Suggestions from today's TWG
Warm start also yields non-bitwise reproducible runs.
Setting explicitly the -qno-opt-dynamic-align -fp-model=precise
flags does not solve the problem.
I think @penguian mentioned a third flag we should try?
I was able to get identical restarts from two 2-time-step runs using a new executable built entirely with CIME. I'm not sure how I could muck this up but it's always possible with me so it would be good if someone could try replicate. The new executable should be usable by anyone on tm70
. The config is here:
https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36
and the restarts are here:
diff -r /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-0/restart000 /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-1/restart000
which returns nothing
@dougiesquire That are actually very good news, as this means the issue is very likely in ESMF. I'll try using the same ESMF build as you and see what happens.
(I tried building the executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6
but I don't have permission)
@dougiesquire you should have write access to /g/data/ik11/inputs/cime
now.
While we're at it, do you want write access to all of /g/data/ik11/
?
While we're at it, do you want write access to all of
/g/data/ik11/
?
Sure, that might be helpful in the near future.
I've rebuilt the CIME executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6/2023-07-13
and updated the path in https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36 accordingly (UPDATE use this commit if trying to do the repro test: https://github.com/dougiesquire/MOM6-CICE6/tree/db12aefdd9dfaac283abb1d0cf3c9cf517005ae5)
ok - you'll need to apply for the ik11_w
subgroup of ik11
on mancini
A quick inspection of the differences between the Spack-built ESMF and the EMSF built by Martin shows that the later was built in debug mode, while the former was built with optimizations. In practice, the later sets-g
, which implies -O0
, while the other one sets -O
(which is equivalent to -O2
). This is the main difference. Other differences include the use of internal vs external Lapack and internal vs external PIO.
I'm struggling to use the same ESMF as the cime build, as the netCDF version used there is not the same as the one of the other dependencies built with Spack, so instead I'm recompiling ESMF with Spack in debug mode.
Okay, so that's confirmed: compiling ESMF in debug mode leads to reproducible runs.
I'm not sure how critical ESMF is for performance, but it might be worth finding out which optimization level can be safely used to compile it.
I was also able to get reproducible runs using 48 cores
The production executable generated with CMake is also bit-wise reproducible :tada:
Note, I was able to also get reproducible runs without these set (with debug ESMF)
I was also able to get reproducible runs using 48 cores
I can confirm both.
The current MOM6-CICE6 config (and presumably others) is not reproducible - compare these