COSIMA / access-om3

ACCESS-OM3 global ocean-sea ice-wave coupled model
13 stars 7 forks source link

not bitwise reproducible #40

Closed aekiss closed 1 year ago

aekiss commented 1 year ago

The current MOM6-CICE6 config (and presumably others) is not reproducible - compare these

/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2
aekiss commented 1 year ago

For strict bit-for-bit reproducibility srcTermProcessing=1 and termOrder=srcseq are required in nuopc.runseq. See details here and here.

_[edit: setting this in nuopc.runseq actually isn't necessary for reproducibility]_

aekiss commented 1 year ago

These runs have identical initial conditions (cold start), identical inputs, parameters and executables:

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/output000/manifests/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/output000/manifests/
$

but the resulting restarts for cice, coupler and mom6 differ (whereas the datm and drof restarts are identical):

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_1/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_2/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ
aekiss commented 1 year ago

details on which individual variables differ are here

aekiss commented 1 year ago

repro_test_3 and repro_test_4 confirm lack of reproducibility with latest debug build /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_3/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_4/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ
aekiss commented 1 year ago

repro_test_5 and repro_test_6 also don't reproduce, despite using srcTermProcessing=1:termOrder=srcseq in nuopc.runseq as described here, which is supposed to provide the strictest bit-for-bit reproducibility in remapping.

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000 /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cice.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.cpl.r.0001-01-02-00000.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_5/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/restart000/GMOM_JRA.mom6.r.0001-01-02-00000.nc differ

I'm not sure what to try next.

aekiss commented 1 year ago

Bit-for-bit reproducibility is discussed in ESMF docs:

and is relevant to many subroutines, e.g.

etc etc - search for "bit-for-bit"

aekiss commented 1 year ago

Couldn't run for 1 timestep with these settings in nuopc.runconfig

     restart_n = 1
     restart_option = nsteps
...
     stop_n = 1
     stop_option = nsteps

due to a segmentation fault.

[gadi-cpu-clx-0432:475851:0:475851] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6630)
==== backtrace (tid: 475877) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000001bb12ee mom_mp_mom_state_is_synchronized_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/src/core/MOM.F90:3830
 2 0x0000000001a7965d mom_ocean_model_nuopc_mp_ocean_model_restart_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_ocean_model_nuopc.F90:723
 3 0x0000000001a0b784 mom_cap_mod_mp_modeladvance_()  /g/data/v45/aek156/access-om3-build/access-om3/MOM6/MOM6/config_src/drivers/nuopc_cap/mom_cap.F90:1690
 4 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
 5 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
 6 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
 7 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
 8 0x00000000069db2cd nuopc_modelbase_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:2220
 9 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
10 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
11 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
12 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
13 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
14 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
15 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
16 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
17 0x00000000006956fc nuopc_driver_mp_executerunsequence_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
18 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
19 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
20 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
21 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
22 0x0000000000692052 nuopc_driver_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
23 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
24 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
25 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
26 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
27 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
28 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
29 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
30 0x0000000000695ea7 nuopc_driver_mp_routine_executegridcomp_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3329
31 0x00000000006956fc nuopc_driver_mp_executerunsequence_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3622
32 0x0000000000fd0938 ESMCI::MethodElement::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
33 0x0000000000fd089a ESMCI::MethodTable::execute()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
34 0x0000000000fcf462 c_esmc_methodtableexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:317
35 0x00000000007be7e2 esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1287
36 0x0000000000692052 nuopc_driver_mp_routine_run_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:3250
37 0x00000000007ccd66 ESMCI::FTable::callVFuncPtr()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
38 0x00000000007d0e6f ESMCI_FTableCallEntryPointVMHop()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
39 0x0000000000d565aa ESMCI::VMK::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2318
40 0x0000000001117c72 ESMCI::VM::enter()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
41 0x00000000007ce1ea c_esmc_ftablecallentrypointvm_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
42 0x000000000070d81d esmf_compmod_mp_esmf_compexecute_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1222
43 0x00000000009e2e71 esmf_gridcompmod_mp_esmf_gridcomprun_()  /jobfs/84887491.gadi-pbs/mo1833/spack-stage/spack-stage-esmf-8.3.1-ptrivi22nfb7u2yrt6xynvqtnnve42yf/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1891
44 0x0000000000431bb4 MAIN__()  /g/data/v45/aek156/access-om3-build/access-om3/CMEPS/CMEPS/cesm/driver/esmApp.F90:141
45 0x0000000000430d62 main()  ???:0
46 0x000000000003ad85 __libc_start_main()  ???:0
47 0x0000000000430c6e _start()  ???:0
=================================
aekiss commented 1 year ago

I was able to do a 2-timestep run. The same 3 restarts still differ:

$ diff -r /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cice.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.cpl.r.0001-01-01-07200.nc differ
Binary files /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_7/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc and /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_8/restart000/GMOM_JRA.mom6.r.0001-01-01-00000.nc differ
kieranricardo commented 1 year ago

@aekiss I haven't checked my AMIP runs but I'd be very surprised if they were reproducible! Can you share your nuopc.runseq and nuopc.runconfig?

I think the nuopc.runseq flags only effect the transfer of data between the components and the mediator. These are just copying data (no regridding involved) so should be bit reproducible anyway. All the actual regridding is happening in CMEPs so we'd have to convince CMEPs to do bit reproducible regridding. Not sure if this is possible or not, I'll have a bit of a dig around later today.

aekiss commented 1 year ago

Thanks @kieranricardo, I hadn't realised that about the nuopc.runseq flags. Having bit reproducibility is really important, e.g. so we can re-run sections of an experiment with different outputs or do regression testing.

aekiss commented 1 year ago

Although MOM6 and CICE6 have the same grid dimensions (lon x lat = 320x384), regridding is needed because MOM6 is C-grid and we are using B-grid CICE6 (at present). The JRA55 data stream has different grid dimensions (640x320) so is also regridded.

When we switch to C-grid CICE6 there will be no need to regrid to couple with MOM6, so any reproducibility issues there should disappear.

aekiss commented 1 year ago

The mediator log med.log shows the regridding method used for each field:

$ grep '^ mapping' /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_6/output000/log/med.log
 mapping atm->ocn Sa_u via patch_uv3d with one normalization
 mapping atm->ocn Sa_v via patch_uv3d with one normalization
 mapping atm->ocn Sa_z via bilnr with one normalization
 mapping atm->ocn Sa_tbot via bilnr with one normalization
 mapping atm->ocn Sa_pbot via bilnr with one normalization
 mapping atm->ocn Sa_shum via bilnr with one normalization
 mapping atm->ocn Sa_ptem via bilnr with one normalization
 mapping atm->ocn Sa_dens via bilnr with one normalization
 mapping atm->ocn Faxa_swnet via consf with one normalization
 mapping atm->ocn Faxa_rainc via consf with one normalization
 mapping atm->ocn Faxa_rainl via consf with one normalization
 mapping atm->ocn Faxa_snowc via consf with one normalization
 mapping atm->ocn Faxa_snowl via consf with one normalization
 mapping atm->ocn Faxa_lwdn via consf with one normalization
 mapping atm->ocn Faxa_swndr via consf with one normalization
 mapping atm->ocn Faxa_swvdr via consf with one normalization
 mapping atm->ocn Faxa_swndf via consf with one normalization
 mapping atm->ocn Faxa_swvdf via consf with one normalization
 mapping atm->ocn Sa_pslv via bilnr with one normalization
 mapping atm->ice Sa_u via patch_uv3d with one normalization
 mapping atm->ice Sa_v via patch_uv3d with one normalization
 mapping atm->ice Sa_z via bilnr with one normalization
 mapping atm->ice Sa_tbot via bilnr with one normalization
 mapping atm->ice Sa_pbot via bilnr with one normalization
 mapping atm->ice Sa_shum via bilnr with one normalization
 mapping atm->ice Sa_ptem via bilnr with one normalization
 mapping atm->ice Sa_dens via bilnr with one normalization
 mapping atm->ice Faxa_swnet via consf with one normalization
 mapping atm->ice Faxa_rainc via consf with one normalization
 mapping atm->ice Faxa_rainl via consf with one normalization
 mapping atm->ice Faxa_snowc via consf with one normalization
 mapping atm->ice Faxa_snowl via consf with one normalization
 mapping atm->ice Faxa_lwdn via consf with one normalization
 mapping atm->ice Faxa_swndr via consf with one normalization
 mapping atm->ice Faxa_swvdr via consf with one normalization
 mapping atm->ice Faxa_swndf via consf with one normalization
 mapping atm->ice Faxa_swvdf via consf with one normalization
 mapping atm->ice Faxa_bcph via consf with one normalization
 mapping atm->ice Faxa_dstwet via consf with one normalization
 mapping atm->ice Faxa_dstdry via consf with one normalization
 mapping atm->ice Sa_pslv via bilnr with one normalization
 mapping ocn->ice So_omask via fcopy
 mapping ocn->ice So_t via fcopy
 mapping ocn->ice So_s via fcopy
 mapping ocn->ice So_u via fcopy
 mapping ocn->ice So_v via fcopy
 mapping ocn->ice So_dhdx via fcopy
 mapping ocn->ice So_dhdy via fcopy
 mapping ocn->ice Fioo_q via fcopy
 mapping ice->ocn Faii_swnet via fcopy
 mapping ice->ocn Si_ifrac via fcopy
 mapping ice->ocn Fioi_swpen via fcopy
 mapping ice->ocn Fioi_swpen_vdr via fcopy
 mapping ice->ocn Fioi_swpen_vdf via fcopy
 mapping ice->ocn Fioi_swpen_idr via fcopy
 mapping ice->ocn Fioi_swpen_idf via fcopy
 mapping ice->ocn Fioi_melth via fcopy
 mapping ice->ocn Fioi_taux via fcopy
 mapping ice->ocn Fioi_tauy via fcopy
 mapping ice->ocn Fioi_meltw via fcopy
 mapping ice->ocn Fioi_salt via fcopy
 mapping rof->ocn Forr_rofl via rof2ocn_liq with none normalization
 mapping rof->ocn Forr_rofi via rof2ocn_ice with none normalization

this seems odd - I expected some regridding from C to B grid

 mapping ocn->ice So_u via fcopy
 mapping ocn->ice So_v via fcopy
kieranricardo commented 1 year ago

@aekiss annoyingly CMEPS only supports one grid/mesh per each component. For the UM cap we only export fields on the density/pressure points, and the mapping from the velocity points to the density points happens inside the CAP. Obviously this is a little suboptimal with some fields going UM v points -> UM p points -> MOM p points -> MOM v points. I think the MOM and CICE caps must be doing the same thing although I haven't found this in the code.

aekiss commented 1 year ago

Thanks for clarifying. That is consistent with what I understood, that the MOM6-CICE6 coupling takes place on the A grid. I had the impression that work is underway to support direct MOM6-CICE6 coupling on the C grid, but that would involve supporting more than one grid per component.

aekiss commented 1 year ago

Info on bitwise reproducibility in MOM6: https://github.com/NOAA-GFDL/MOM6-examples/wiki/Developers-guide#debugging

aekiss commented 1 year ago

The CMEPS driver CMEPS/CMEPS/cesm/driver/esm.F90 apparently supports reproducible summation implemented in share/CESM_share/src/shr_reprosum_mod.F90.

aekiss commented 1 year ago

In CICE6 we use kvp=1 (EVP rheology), which is the only dynamics option that's bit-for-bit reproducible.

CICE6 supports reproducible sums depending on the setting of bfbflag, but this only affects global diagnostics written to the CICE log file, not the prognostic variables which are bit-for-bit identical with any bfbflag. We use bfbflag = "off", so global diagnostics won't be reproducible.

aekiss commented 1 year ago

@kieranricardo following up on your earlier comment, my nuopc.runseq and nuopc.runconfig are in the access_exe branch here.

aekiss commented 1 year ago

@kieranricardo well I'm stumped. It's not reproducible with 1 CPU for all components. I expect I'm missing something obvious...

MartinDix commented 1 year ago

Is there a NUOPC option to save all the coupling fields to a file (preferably both sides of the interpolation).

Dave added this to CM2 and it's been very useful.

kieranricardo commented 1 year ago

@aekiss what?! That's bizarre.... Could MOM or CICE be using threads anywhere? I'll have a closer look and see if CMEPS or CDEPS is.

kieranricardo commented 1 year ago

@aekiss can you run for less than one coupling time step (not 100% sure if this possible) just to verify that it's the coupling causing the issues? Hopefully that'll be reproducible.

It might also be worth logging the number of OMP threads, if something is setting them > 1 then CICE at least will be parallel which might make the cap non-reproducible.

aekiss commented 1 year ago

@kieranricardo only 1 thread is being used

$ grep OMP_ /scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_9/output000/env.yaml
OMP_NUM_THREADS: '1'

not sure if we can run for less than 1 coupling timestep, but I'll look into it

aekiss commented 1 year ago

I tried running with this nuopc.runseq in the hope that it would be an uncoupled run, but it aborted after apparently initialising all components

runSeq::
@3600
  ICE
  ROF
  OCN
  ATM
@
::
aekiss commented 1 year ago

Thanks for the suggestion @MartinDix - I enabled writing some ATM->MED coupler output every timestep with this commit.

The resulting files

/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.avrg.0001-01-01-03600.nc
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3-26da3bf/output000/GMOM_JRA.cpl.hx.atm.1step.inst.0001-01-01-03600.nc

span the full single-precision range and look like nonsense (though not random), e.g.

Screenshot 2023-07-04 at 4 24 50 pm (I've restricted the range to [-1, 1] for plotting purposes)

Is this some sort of type conversion error?

Or uninitialised arrays? Though it doesn't look random enough (similar patterns appear for all variables and time steps) and I'm using the debug executable /g/data/ik11/inputs/access-om3/bin/access-om3-MOM6-CICE6-Debug-ce8d88e which I'd hope would prevent access to uninitialised memory.

aekiss commented 1 year ago

In any case, these files are identical when I re-run, so they aren't related to the non-reproducibility.

MartinDix commented 1 year ago

It does seem to be some sort of 64/32 bit mixup. I get some plausible values by treating pairs of 32 bit values as 64 bit.

atmImp_lon[0,0,:]
[0.       , 3.8515625, 0.       , 1.984375 , 0.       , 2.21875  ,
 0.       , 2.3515625, 0.       , 2.46875  , 0.       , 2.5429688,
.....
0.       , 3.5905762, 0.       , 3.5942383, 0.       , 3.5979004],

After conversion

[360.   ,   1.875,   3.75 ,   5.625,   7.5  ,   9.375,  11.25 ,
...
 354.375, 356.25 , 358.125])

Fields also look reasonable, though there's now only half the grid. This is plausibly some sort of surface SW radiation.

atmImp_Faxa_swndr

aekiss commented 1 year ago

Thanks @MartinDix, that looks a whole lot better! It's swvdr which sounds (and looks) like downwelling shortwave

aekiss commented 1 year ago

I split this off as a separate issue https://github.com/COSIMA/access-om3/issues/44

aekiss commented 1 year ago

@micaeljtoliveira this notebook compares the restarts from these 1-CPU, 2-timestep runs and plots the differing fields at the surface at the final time - see renders here; the difference scales are 10% of the full range.

/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_9
/scratch/v45/aek156/access-om3/archive/MOM6-CICE6_ACCESS-OM3_repro_test_10

The CICE differences seem confined to the Arctic, e.g. download-1 download-2

whereas the MOM differences are in a bunch of scattered patches, e.g. download

Coupler field differences look similar to the source model component (CICE or MOM).

The script also provides a table of max and min differences and relative differences for each field. Many of the relative differences are very small, but some are very large, especially considering it's only 2 timesteps (e.g. a factor of over 100 for stressm_2, but maybe that's due to ice being present in one case and absent in the other, as the fig of stressm_2 above generally doesn't look that dramatic).

aekiss commented 1 year ago

At least the plots show that the gross data corruption in the mediator diagnostics is diagnostic-only and apparently not affecting the run itself.

MartinDix commented 1 year ago

One day runs with Kieran's um-mom-cice configuration are reproducible.

aekiss commented 1 year ago

Ah, interesting! Thanks for checking. Does that narrow the problem down to DATM, DROF and/or their coupling to the mediator?

It's possibly relevant that these are the only components of our MOM6-CICE6 config that involve regridding to a different resolution (though you're doing this with the UM too).

aekiss commented 1 year ago

@micaeljtoliveira oops, correction - it was the mediator diagnostics I saved every timestep, but I actually only saved restarts at the end of the 2nd timestep. To save them every timestep, set restart_n = 1 in nuopc.runconfig.

aekiss commented 1 year ago

Suggestions from today's TWG

micaeljtoliveira commented 1 year ago

Warm start also yields non-bitwise reproducible runs.

micaeljtoliveira commented 1 year ago

Setting explicitly the -qno-opt-dynamic-align -fp-model=precise flags does not solve the problem.

aekiss commented 1 year ago

I think @penguian mentioned a third flag we should try?

dougiesquire commented 1 year ago

I was able to get identical restarts from two 2-time-step runs using a new executable built entirely with CIME. I'm not sure how I could muck this up but it's always possible with me so it would be good if someone could try replicate. The new executable should be usable by anyone on tm70. The config is here:

https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36

and the restarts are here:

diff -r /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-0/restart000 /scratch/tm70/ds0092/access-om3/archive/MOM6-CICE6-1/restart000

which returns nothing

micaeljtoliveira commented 1 year ago

@dougiesquire That are actually very good news, as this means the issue is very likely in ESMF. I'll try using the same ESMF build as you and see what happens.

dougiesquire commented 1 year ago

(I tried building the executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6 but I don't have permission)

aekiss commented 1 year ago

@dougiesquire you should have write access to /g/data/ik11/inputs/cime now. While we're at it, do you want write access to all of /g/data/ik11/?

dougiesquire commented 1 year ago

While we're at it, do you want write access to all of /g/data/ik11/?

Sure, that might be helpful in the near future.

I've rebuilt the CIME executable to /g/data/ik11/inputs/cime/bin/MOM6-CICE6/2023-07-13 and updated the path in https://github.com/dougiesquire/MOM6-CICE6/tree/om2_grid_iss36 accordingly (UPDATE use this commit if trying to do the repro test: https://github.com/dougiesquire/MOM6-CICE6/tree/db12aefdd9dfaac283abb1d0cf3c9cf517005ae5)

aekiss commented 1 year ago

ok - you'll need to apply for the ik11_w subgroup of ik11 on mancini

micaeljtoliveira commented 1 year ago

A quick inspection of the differences between the Spack-built ESMF and the EMSF built by Martin shows that the later was built in debug mode, while the former was built with optimizations. In practice, the later sets-g, which implies -O0, while the other one sets -O (which is equivalent to -O2). This is the main difference. Other differences include the use of internal vs external Lapack and internal vs external PIO.

I'm struggling to use the same ESMF as the cime build, as the netCDF version used there is not the same as the one of the other dependencies built with Spack, so instead I'm recompiling ESMF with Spack in debug mode.

micaeljtoliveira commented 1 year ago

Okay, so that's confirmed: compiling ESMF in debug mode leads to reproducible runs.

I'm not sure how critical ESMF is for performance, but it might be worth finding out which optimization level can be safely used to compile it.

dougiesquire commented 1 year ago

For strict bit-for-bit reproducibility srcTermProcessing=1 and termOrder=srcseq are required in nuopc.runseq. See details here and here.

Note, I was able to also get reproducible runs without these set (with debug ESMF)

dougiesquire commented 1 year ago

I was also able to get reproducible runs using 48 cores

micaeljtoliveira commented 1 year ago

The production executable generated with CMake is also bit-wise reproducible :tada:

micaeljtoliveira commented 1 year ago

Note, I was able to also get reproducible runs without these set (with debug ESMF)

I was also able to get reproducible runs using 48 cores

I can confirm both.