JMMP-Group / GO_coordination

Repository to coordinate the joint development of GOSI configurations
6 stars 1 forks source link

GOSI10 porting to NEMO5 #22

Open isabellaascione opened 5 months ago

isabellaascione commented 5 months ago
  1. Testing GOSI10 with NEMO5 beta
  1. Testing GOSI10 with NEMO5 release candidate
isabellaascione commented 5 months ago

Changes identified with NEMO5 beta release:

NEMO-ocean parameters

parameter status at NEMO5 beta GOSI10 at NEMO5 beta GOSI10 Trial at NEMO 4.2.2 , for comparison with GOSI10 at NEMO5 beta GOSI10 at NEMO 4.2.2 (standard)
namrun
ln_top $${\color{green}new}$$ TRUE (but used only with key_top) not available not available
ln_rst_eos $${\color{gray} N/A }$$ (GOSI10 only) TRUE (ported from GOSI10 package) TRUE TRUE
namdom
ln_linssh $${\color{red}removed}$$ N/A FALSE FALSE
ln_shuman $${\color{green}new}$$ FALSE not available not available
namlbc
ln_shlat2d $${\color{gray} N/A}$$ (GOSI10 4.2.2 only) N/A (not ported) FALSE TRUE
rn_shlat 0 0 -9999
namsbc
ln_dm2dc $${\color{blue}preexisting}$$ FALSE FALSE TRUE
namsbc_blk
ln_MFS $${\color{green}new}$$ FALSE not available not available
ln_prec_met $${\color{green}new}$$ FALSE not available not available
rn_Cd_ia $${\color{orange}renamed}$$ rn_Cd_ia=1.4e-3 rn_Cd_i=1.4e-3 rn_Cd_i=1.4e-3
rn_Ce_ia $${\color{orange}renamed}$$ rn_Ce_ia=1.4e-3 rn_Ce_i=1.4e-3 rn_Ce_i=1.4e-3
rn_Ch_ia $${\color{orange}renamed}$$ rn_Ch_ia=1.4e-3 rn_Ch_i=1.4e-3 rn_Ch_i=1.4e-3
ln_Cx_ice_frm $${\color{green}new}$$ false not available not available
nn_frm $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cs_io $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cs_ia $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cr_ia $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cr_io $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cf_ia $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
rn_Cf_io $${\color{green}new}$$ set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.) not available not available
namsbc_abl
ln_pga_abl $${\color{green}new}$$ false (only for ln_abl = T) not available not available
rn_vfac $${\color{green}new}$$ (only for ln_abl = T) not available not available
namsbc_blk
ln_crt_fbk $${\color{blue}preexisting}$$ TRUE TRUE FALSE
rn_stau_a $${\color{blue}preexisting}$$ -2.90E-03 -2.90E-03 not used
rn_stau_b $${\color{blue}preexisting}$$ 8.00E-03 8.00E-03 not used
rn_vfac $${\color{gray} N/A}$$ (GOSI10 4.2.2 only) N/A (not ported) 0 1
namtra_qsr
nn_chldta $${\color{blue}preexisting}$$ 1 1 0
rn_chl_conc $${\color{gray} N/A }$$ (GOSI10 4.2.2 only) N/A (not ported) not used 0.1
nameos
rn_T0 $${\color{green}new}$$ 10 hardcoded to 10 hardcoded to 10
rn_S0 $${\color{green}new}$$ 35 hardcoded to 35 hardcoded to 35
namtra_adv
nn_fct_imp $${\color{green}new}$$ it should be 2 but that does not work , set to 1 not available not available
namtra_eiv
ln_eke_equ
namldf_eke $${\color{green}new}$$ namelist with new parameters (not listed in this table) unused because nn_aei_ijk_t=21 not available not available
namvvl
nn_vvl_interp $${\color{blue}preexisting}$$ 2 2 0
namdyn_adv
ln_dynadv_ubs $${\color{red}removed}$$ N/A FALSE FALSE
namdyn_vor
ln_dynvor_eeT $${\color{red}removed}$$ N/A FALSE FALSE
namdyn_hpg
ln_hpg_zps $${\color{red}removed}$$ N/A FALSE FALSE
namdyn_spg
ln_bt_av $${\color{red}removed}$$ now it is hardcoded to TRUE (ll_bt_av in dynspg_ts.F90) N/A TRUE TRUE
rn_htau_scaling $${\color{gray} N/A}$$ (GOSI10 4.2.2 only) ported 1 1
nammpp
ln_mppdelay $${\color{green}new}$$ FALSE not available not available
nn_hls $${\color{blue}preexisting}$$ 2 1 (should be 2, but does not have much impact ) 1
namzdf
ln_zdfiwm $${\color{blue}preexisting}$$ TRUE TRUE FALSE
ln_zdftmx $${\color{gray}N/A }$$ (GOSI10 4.2.2 only) N/A (not ported) FALSE TRUE
rn_avm0 $${\color{blue}preexisting}$$ 1.40E-06 1.40E-06 1.20E-04
rn_avt0 $${\color{blue}preexisting}$$ 1.00E-10 1.00E-10 1.20E-05

Sea-Ice parameters

parameter parameter status in NEMO5 beta GOSI10 at NEMO5 beta GOSI10 trial at NEMO 4.2.2 (for comparison with GOSI10 at NEMO5 beta) GOSI10 at NEMO 4.2.2 (standard)
rn_cio= 1.0e-2 $${\color{orange}renamed}$$ rn_Cd_io= 1.0e-2 rn_cio= 1.0e-2 rn_cio= 1.0e-2
ln_icedS $${\color{red}removed}$$ N/A ln_icedS=.true. ln_icedS=.true.
ln_flushing $${\color{green}new}$$ TRUE Not available as option in the namelist but included in the code Not available as option in the namelist but included in the code
ln_drainage $${\color{green}new}$$ TRUE Not available as option in the namelist but included in the code Not available as option in the namelist but included in the code
rn_sinew $${\color{green}new}$$ when using nn_icesal=2, initially set to 0.3 as per recommendation for NEMO5beta. then set to 0.25 as per reccomendation for NEMO5rc. rn_sinew *sss_1d is used instead of rn_simax, so will give different result from 4.2.2 N/A N/A
rn_simax $${\color{red}removed}$$ N/A 20 20
rn_sal_himin $${\color{green}new}$$ 0 Hardcoded to 0. Hardcoded to 0.
ln_pnd_rain $${\color{green}new}$$ FALSE N/A N/A
rn_pnd_hl_min $${\color{green}new}$$ 0.005 Not available as parameter, but hardcoded to 0.005 Not available as parameter, but hardcoded to 0.005
rn_pnd_hl_max $${\color{green}new}$$ 0.015 Not available as parameter, but hardcoded to 0.015 Not available as parameter, but hardcoded to 0.015
rn_alb_lpnd $${\color{green}new}$$ set to 0.36, it should do the same as 4.2.2 N/A N/A
nn_liquidus $${\color{green}new}$$ 1 N/A N/A
nn_pnd_brsal $${\color{gray} N/A }$$ (GOSI10 4.2.2 only ) N/A, use nn_liquidus instead 0 0
isabellaascione commented 4 months ago

BUG

the parameter nn_fct_imp has been introduced with the 5-beta release. The default value is 1. To replicate the behaviour of 4.2.2, it should be set to 2. However this option seems to be broken. I let Daley know about this, and an issue should be opened on the main nemo repository.

update: Daley proposed a fix: https://forge.nemo-ocean.eu/nemo/nemo/-/issues/499

dcalve commented 4 months ago

The r5282_nemo5rc_changes_and_fix_intel_xios MOCI branch should be used to run GOSI10 after it has been rebased onto the NEMO 5 release candidate (currently it is a branch of the beta).

This is required because the NEMO 5 release candidate has removed lib_cray.f90, necessitating an update to the fcm-make configurations in MOCI. The above branch contains these changes as well as those from r4684_fix_intel_xios, which we currently use.

Suites using GOSI10 based on the NEMO 5 release candidate will need the following update in app/fcm_make_ocean/rose-app.conf:

config_root_path=fcm:MOCI.xm/branches/dev/daleycalvert/r5282_nemo5rc_changes_and_fix_intel_xios
nemo_version_file=nemo5.0-version.cfg
isabellaascione commented 4 months ago

The NEMO 5 RC is released.

moving from NEMO5-beta to NEMO5-RC ,

the new reference namelist says to use 0.25 instead of 0.3 (as in NEMO5 beta) when using nn_icesal=2

Additionally, the following changes happened to the namelists:

namhsb removed namelist
ln_diahsb removed
namdom
ln_crs removed
namcfg
ln_use_jattr removed
namcrs removed
nn_factx removed
nn_facty removed
nn_binref removed
ln_msh_crs removed
nn_crs_kz removed
ln_crs_wn removed
namc1d it now includes the parameters from the namelists namc1d_dyndmp and namc1d_uvd, which were present in 5-beta. namc1d_dyndmp and namc1d_uvd no longer exist as namelists
nam_tide
ln_tide_dia removed
namtra_eiv
ln_ldfeiv_dia removed
namflo removed
ln_floats removed
jpnfl removed
jpnnewflo removed
ln_rstflo removed
nn_writefl removed
nn_stockfl removed
ln_argo removed
ln_flork4 removed
ln_ariane removed
ln_flo_ascii removed
nam_dia25h removed
ln_dia25h removed
namnc4
ln_nc4zip default is .false.
isabellaascione commented 3 months ago

I tested the GOSI10 p2.0 with ORCA025. In the run with NEMO 5 beta I was using rn_sinew=0.3 with nn_icesal=2, as recommended in the NEMO5-beta release. The run crashed after about 25 years, in July 2001, with errors in outputting variable sfxice.

looking into the code, the nn_icesal =2 option is no longer coded as it was in NEMO 4.2.2. I did 1 day run test and compared the variable sfxice with the reference at GOSI10 p2.0 . The results (below, picture to the left) show that the nn_icesal=2 with rn_sinew=0.3 produce different results from GOSI10 p2.0 (middle picture).

I backported the formulas used in NEMO 4.2.2 into the branch (https://forge.nemo-ocean.eu/devs/ukmo/gosi/gosi/-/commit/b5e4620c40b9fc064630b0007a22d58c5effa9b8) . With this change, nn_icesal=2 option no longer uses the parameterisation based on rn_sinew.

the results in the picture below show that the sfxice variable (right picture) is consistent with GOSI10p2.0 (picture in the middle) when using the formulation from NEMO4.2.2

image

With NEMO5 RC the recommendation is now to use rn_sinew=0.25 with nn_icesal=2.

Emma also advised to try nn_icesal=4, which uses a default of rn_sinew=0.75 in the reference namelist.

Following the above, I set up the following 3 long runs:

suite description outcome
u-di347 GOSI10pX.0 (based on NEMO5RC). nn_icesal=2 and rn_sinew=0.25 this run crashed shortly after 20 years , in April 1997, with error NetCDF: Numeric conversion not representable Unable to write data given the location id: 65536 and the variable whose id: 58 and name: sfxice
u-di581 GOSI10pX.0 adapted to NEMO5RC , using nn_icesal=4 and rn_sinew=0.75 crashes on 19890601 wuh same error as above
u-di568 GOSI10pX.0 adapted to NEMO5RC , but backporting the formulations used in NEMO 4.2.2. Using nn_icesal=2 and re-introducing rn_simax=20.0 . rn_sinew is not used in this run crashes on 19890601 wuh same error as above
isabellaascione commented 3 months ago

@emmafiedler

@edblockley

@alex.west49

isabellaascione commented 2 months ago

NEMO system meeting on 04/09/2024

there has been discussion about validation effort. the full minutes of the meeting can be found at (login required): https://forge.nemo-ocean.eu/nemo/nst/-/wikis/meetings/2024_09_4th

Notes for the validation efforts (from the minutes):

CNRS:

ran ORCA1 for 120 years for v4.2 (MLF) and v5.0 (MLF and RK3). Found an issue with sea-ice quantities when using RK3: under investigation (eORCA1 monitoring here).

no particular issue found in coupled ocean-atmosphere global and regional configurations with NEMOv5.0 (But not check MLF vs RK3).

Used XIOS3 following the instructions of Andrew on the draft user guide (here and it worked really well)

NOC:

ORCA025, not noticed any problems for first 5 years.

CMCC:

Did not notice any issues for Black & Mediterranean sea regional configs during the first year of simulation wave coupling will be validated soon

UKMET:

isabellaascione commented 2 months ago

NEMO system meeting on 25/09/2024

the full meeting minutes are available at (login required):

https://forge.nemo-ocean.eu/nemo/nst/-/wikis/meetings/2024_09_25th

notes related to validation efforts:

dcalve commented 1 month ago

Investigation into the stability issue

This is a summary of my investigation into the issue described in Isabella's comment above.

I was able to replicate the issue in an experiment without the GOSI10 code (so just using the NEMO 5.0 release candidate). The following findings are based on this configuration.

The issue

The following error:

NetCDF: Numeric conversion not representable Unable to write data given the location id: 65536 
and the variable whose id: 58 and name: sfxice

occurs when XIOS is unable to convert data to the precision required by the netCDF file when writing to it. This usually means that the data contains NaNs, infinite values, or very large/small values.

In this case, the reported diagnostic (sfxice) uses data from the SI3 salt flux sfx, which did indeed have very large values at some point during the cycle that crashed.

Source of the issue

The NEMO domain and north fold

The NEMO grid is split into two parts:

In a global ORCA configuration, the east and west boundaries are periodic, the south boundary is closed and the north boundary is periodic with a north fold. The following figure shows a schematic of the north boundary for the U grid on a T-point north fold:

image

In this figure, the internal domain is the area below the thick black line between rows jpj-1 and jpj-2. The halo points are copies of this data (compare cells with the same colours), mirrored around the central columns (jpiglo/2 and jpiglo/2+1) and across the top row of the internal domain (jpj-2) so that the sign is changed- this is the north fold.

This particular north fold type (mirrored along T grid points) means that the right half of the top row of the inner domain contains duplicate data from the left half- this is shown by the light grey shading. I will refer to this specific part of the north fold as the "internal north fold" below.

The internal north fold is not normally updated by lbc_lnk, i.e. the left half of row jpj-2 is not copied onto the right half, as it is part of the internal domain. This assumes that the results of the calculation will be identical on both halves of the row, but this isn't necessarily true. In fact, this point is acknowledged within the NEMO code itself:

! In theory we should exchange only nn_hls lines.
!
! However, some other points are duplicated in the north pole folding:
!  - c_NFtype='T', grid=T : half of the last line (jpiglo/2+2:jpiglo-nn_hls)
!  - c_NFtype='T', grid=U : half of the last line (jpiglo/2+1:jpiglo-nn_hls)
!  - c_NFtype='T', grid=V : all the last line nn_hls+1 and (nn_hls+2:jpiglo-nn_hls)
!  - c_NFtype='T', grid=F : all the last line (nn_hls+1:jpiglo-nn_hls)
!  - c_NFtype='F', grid=U : no points are duplicated
!  - c_NFtype='F', grid=V : half of the last line (jpiglo/2+1:jpiglo-nn_hls)
!  - c_NFtype='F', grid=F : half of the last line (jpiglo/2+1:jpiglo-nn_hls-1)
! The order of the calculations may differ for these duplicated points (as, for example jj+1 becomes jj-1)
! This explain why these duplicated points may have different values even if they are at the exact same location.
! In consequence, we may want to force the folding on these points by setting l_full_nf_update = .TRUE.
! This is slightly slower but necessary to avoid different values on identical grid points!!

This characteristic of the internal north fold is relevant in the next section, which describes the source of the issue.

Instabilities in sea ice velocity on the north fold

The large sfx values were tracked down to large sea ice velocities (~45m/s) calculated by the SI3 EVP rheology scheme (icedyn_rhg_evp.F90) that consistently appeared on the internal north fold. By comparison, the corresponding points on the other side of the north fold had much more realistic velocities. These differences manifest quickly- within 120 solver substeps on the first timestep- and the velocities on the internal north fold grow over subsequent timesteps.

These differences can be eliminated by adding the ldfull=.TRUE. argument to the lbc_lnk calls for u_ice and v_ice in icedyn_rhg_evp.F90. This argument causes data on the internal north fold to be updated from data on the other side of the north fold. Testing shows that this is sufficient to prevent the growth of instabilities in the calculation of u_ice and v_ice (using the global maximum Courant number as a measure).

In summary, the calculations underpinning u_ice and v_ice give very slightly different results on the internal north fold, compared to those on the other side of the north fold. The next section outlines a theory as to why.

It isn't clear to me why these differences would amplify over time- perhaps someone more familiar with the numerical aspects of SI3 would have an idea?

Underlying cause

Floating point arithmetic consistency and Fused Multiply Add operations (FMAs)

Floating point arithmetic is not exact- a+b is in practice equal to a+b+E(a,b) where E is a rounding error. Crucially, this means that floating point arithmetic is usually neither associative nor distributive. For example, a+b+c requires two operations to calculate, which could be represented by the compiler as either (a+b)+c or a+(b+c). Because the rounding error in each case is different, the results will not be exactly the same. For this reason, brackets may need to be inserted to ensure consistent results by fixing the order of operations.

Fused Multiply-Add operations (FMAs) are an optimisation that allow arithmetic of the form a*b+c to be calculated using a single floating point operation. This is faster and more accurate than using two operations, i.e. (a*b)+c, but is no less prone to giving inconsistent results.

For example, c = a**2 - b**2 could be calculated by the compiler as a series of 3 operations without FMAs: c = (a*a) - (b*b). With FMAs this is reduced to 2 operations- one of which is an FMA- but there is ambiguity in the order of operations:

d = a*a
c1 = d - b*b        ! FMA
d = b*b
c2 = a*a - d        ! FMA

c1 and c2 will not be exactly the same, again due to differing rounding errors in the floating point calculation. In fact, c1 or c2 may be negative even if a == b.

Role of FMAs in the instability issue

The calculations of u_ice and v_ice on either side of the internal north fold give consistent results (and prevent the instability) when one of the following Intel compiler flags is used:

Each of these flags disables FMAs, implying that FMAs are responsible for the differing results on either side of the north fold. Why? My theory is that this is due to a differing order of operations, similar to an issue we encountered when moving from 1 row to 2 rows of halo points in NEMO 4.2.

Take the following arithmetic:

c(ji,jj) = a(ji,jj-1) + a(ji,jj) + a(ji,jj+1)

This may be calculated by the compiler as:

c(ji,jj) = a(ji,jj-1) + ( a(ji,jj) + a(ji,jj+1) )

On the other side of the north fold the calculation will be mirrored, so that a(ji,jj-1) and a(ji,jj+1) values are swapped. The calculation then effectively becomes:

c(ji,jj) = a(ji,jj+1) + ( a(ji,jj) + a(ji,jj-1) )

This gives a slightly different result, because the order of operations has changed. Brackets must therefore be inserted into the expression to fix the order of operations and ensure consistent results on either side of the north fold.

lbc_lnk copies data from the internal domain on one side of the north fold onto the halo points on the other side. This eliminates any difference in results on either side of the north fold. When moving to 2-point haloes, many lbc_lnk calls were replaced by direct calculation on the halo points. This led to results changing and required brackets to be inserted at various points in the code to recover the previous results.

I believe the same thing is happening with FMAs in icedyn_rhg_evp.F90- they are giving different results on either side of the north fold because the order of operations is different.

Solutions

As discussed previously, brackets could be inserted to ensure a consistent order of operations, in this case with the aim of forcing the compiler to consistently implement FMAs in an arithmetic expression. However, this is not feasible for several reasons:

This issue is likely to be exclusive to our (the Met Office's) compiler and platform, meaning that it probably won't follow us onto the Gen 1 HPC (which does not have the Intel compiler). Therefore, a better solution is to use one of the Intel compiler flags that disable FMAs. This may slow the model down, but I think this would be an acceptable compromise for the short time we will be on the current HPC.

My suggestion is to use -fp-model consistent (replacing -fp-model precise) rather than -no-fma, since the latter flag is set by the former, and the former flag is used by other groups in the NEMO consortium.

Tests in progress and outstanding issues with sea ice conservation

Several experiments based on my NEMO 5.0rc "GOSI10-like" suite (u-di687) are currently running for 30 years, each of which is trialling a different fix. These experiments were all run from the same suite, but are being archived under different experiment IDs:

All experiments failed at some point with an ice grid-point conservation issue. These conservation diagnostics are enabled with ln_icediachk=.true. and check that spurious changes in ice mass, salt and heat do not exceed a given tolerance. Globally-integrated changes and changes within a grid cell have different tolerances, with the latter being more strict and raising an error while the former only prints a warning. Examples of these warnings are:

iceupdate : violation mass cons. [kg] =    32780928625.5002     
iceupdate : violation heat cons. [J]  =  -4.232530907118238E+015
iceupdate : violation mass cons. [kg] =    25777135366.6723     
iceupdate : violation heat cons. [J]  =  -4.261563321882954E+015

We do not normally enable these checks- I'm not sure why. I enabled them while looking into the instability issue since they served as a useful diagnostic. I kept them enabled in the above 30-year tests in the hope that the conservation warnings would be resolved by the fixes being tested. However, while these warnings have been significantly reduced in frequency- they now appear for a couple of months every 2-6 years rather than most months every year- they are still occurring. Eventually, the grid-point conservation checks stopped all 3 experiments.

I believe these conservation issues are another characteristic of the Intel compiler. I have been running GOSI9p8.1 (based on NEMO 4.0.4) with the Cray (u-dk107) and Intel (u-dk108) compilers. The Intel compiler outputs similar warnings from the sea ice conservation diagnostics (it reports negative enthalpies rather than spurious mass/heat), but the Cray compiler (the standard for GOSI9 suites) outputs no warnings at all.

Given that we will be moving to Gen 1 and away from the Intel compiler, it's likely that the conservation warnings will disappear. So I mention this as something to keep an eye on when we start running on Gen 1, but I suspect it will no longer be an issue.

I have continued u-di687 and u-dj999 without the sea ice conservation diagnostics for the sake of completing 30 years. These should tell us if there is any systematic difference between these two fixes in terms of scientific results, as well as whether -fp-model consistent will significantly affect run times.

A related issue- restartability

In the course of investigating the above issue, I also noticed that NEMO 5.0rc was not restartable- i.e. a 30-day run did not give the same results as 3 10-day runs. This issue has also been traced back to the use of FMAs and the same compiler flags resolve it, although I don't know whether the issue is specific to the Intel compiler.

As for the other issue, it's likely that this one won't follow us onto Gen 1, so I again recommend using -fp-model consistent instead of -fp-model precise in the compiler flags.

isabellaascione commented 2 weeks ago

Preliminary results running GOSI10 based on NEMO5 rc : ValidationJMMP_update_Nov.pptx

full validation notes:

ocean: https://gws-access.jasmin.ac.uk/public/jmmp/valid_ocean/u-dk027_10y_vs_u-dh157_10y/assess.html

sea-ice: https://gws-access.jasmin.ac.uk/public/jmmp/valid_ice/u-dk027_10y_si3_vs_u-dh157_10y_si3/assess.html