isabellaascione commented 5 months ago

Testing GOSI10 with NEMO5 beta

[x] NEMO5 beta released in on https://forge.nemo-ocean.eu/devs/ukmo/gosi/gosi/-/commits/5.0-beta?ref_type=tags
[x] Create issue and branch on GOSI repository https://forge.nemo-ocean.eu/devs/ukmo/gosi/gosi/-/issues/19
[x] Port GOSI10 package changes to NEMO5 beta. GOSI10 will have different settings compared to standard GOSI10 at 4.2.2
[x] set up a rose suite to run with the modified version of GOSI10 at NEMO5 beta .

$${\color{red} ✗ }$$ run 30 years : the run crashed after 25 years with errors in writing the variable sfxice

$${\color{red} ✗ }$$ validation notes : these cannot be run for the last 10 years of the experiment due to the failure above described

Testing GOSI10 with NEMO5 release candidate

[x] NEMO5 RC released on 5th August 2024
[x] Update issue and code on GOSI repository : the branch for testing is named 19-upgrade-to-nemo5
[x] Set up rose suite to run with NEMO5 RC
[x] run 30 y experiment
[x] validation notes

isabellaascione commented 5 months ago

Changes identified with NEMO5 beta release:

NEMO-ocean parameters

	parameter status at NEMO5 beta	GOSI10 at NEMO5 beta	GOSI10 Trial at NEMO 4.2.2 , for comparison with GOSI10 at NEMO5 beta	GOSI10 at NEMO 4.2.2 (standard)
namrun
ln_top	$${\color{green}new}$$	TRUE (but used only with key_top)	not available	not available
ln_rst_eos	$${\color{gray} N/A }$$ (GOSI10 only)	TRUE (ported from GOSI10 package)	TRUE	TRUE

namdom
ln_linssh	$${\color{red}removed}$$	N/A	FALSE	FALSE
ln_shuman	$${\color{green}new}$$	FALSE	not available	not available


namlbc
ln_shlat2d	$${\color{gray} N/A}$$ (GOSI10 4.2.2 only)	N/A (not ported)	FALSE	TRUE
rn_shlat		0	0	-9999

namsbc
ln_dm2dc	$${\color{blue}preexisting}$$	FALSE	FALSE	TRUE

namsbc_blk
ln_MFS	$${\color{green}new}$$	FALSE	not available	not available
ln_prec_met	$${\color{green}new}$$	FALSE	not available	not available
rn_Cd_ia	$${\color{orange}renamed}$$	rn_Cd_ia=1.4e-3	rn_Cd_i=1.4e-3	rn_Cd_i=1.4e-3
rn_Ce_ia	$${\color{orange}renamed}$$	rn_Ce_ia=1.4e-3	rn_Ce_i=1.4e-3	rn_Ce_i=1.4e-3
rn_Ch_ia	$${\color{orange}renamed}$$	rn_Ch_ia=1.4e-3	rn_Ch_i=1.4e-3	rn_Ch_i=1.4e-3
ln_Cx_ice_frm	$${\color{green}new}$$	false	not available	not available
nn_frm	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cs_io	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cs_ia	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cr_ia	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cr_io	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cf_ia	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available
rn_Cf_io	$${\color{green}new}$$	set to default as per reference namelist, but not used because ln_Cx_ice_frm=.false.)	not available	not available

namsbc_abl
ln_pga_abl	$${\color{green}new}$$	false (only for ln_abl = T)	not available	not available
rn_vfac	$${\color{green}new}$$	(only for ln_abl = T)	not available	not available

namsbc_blk
ln_crt_fbk	$${\color{blue}preexisting}$$	TRUE	TRUE	FALSE
rn_stau_a	$${\color{blue}preexisting}$$	-2.90E-03	-2.90E-03	not used
rn_stau_b	$${\color{blue}preexisting}$$	8.00E-03	8.00E-03	not used
rn_vfac	$${\color{gray} N/A}$$ (GOSI10 4.2.2 only)	N/A (not ported)	0	1

namtra_qsr
nn_chldta	$${\color{blue}preexisting}$$	1	1	0
rn_chl_conc	$${\color{gray} N/A }$$ (GOSI10 4.2.2 only)	N/A (not ported)	not used	0.1


nameos
rn_T0	$${\color{green}new}$$	10	hardcoded to 10	hardcoded to 10
rn_S0	$${\color{green}new}$$	35	hardcoded to 35	hardcoded to 35

namtra_adv
nn_fct_imp	$${\color{green}new}$$	it should be 2 but that does not work , set to 1	not available	not available

namtra_eiv
ln_eke_equ

namldf_eke	$${\color{green}new}$$ namelist with new parameters (not listed in this table)	unused because nn_aei_ijk_t=21	not available	not available

namvvl
nn_vvl_interp	$${\color{blue}preexisting}$$	2	2	0

namdyn_adv
ln_dynadv_ubs	$${\color{red}removed}$$	N/A	FALSE	FALSE

namdyn_vor
ln_dynvor_eeT	$${\color{red}removed}$$	N/A	FALSE	FALSE

namdyn_hpg
ln_hpg_zps	$${\color{red}removed}$$	N/A	FALSE	FALSE

namdyn_spg
ln_bt_av	$${\color{red}removed}$$ now it is hardcoded to TRUE (ll_bt_av in dynspg_ts.F90)	N/A	TRUE	TRUE


rn_htau_scaling	$${\color{gray} N/A}$$ (GOSI10 4.2.2 only)	ported	1	1
nammpp
ln_mppdelay	$${\color{green}new}$$	FALSE	not available	not available
nn_hls	$${\color{blue}preexisting}$$	2	1 (should be 2, but does not have much impact )	1

namzdf
ln_zdfiwm	$${\color{blue}preexisting}$$	TRUE	TRUE	FALSE
ln_zdftmx	$${\color{gray}N/A }$$ (GOSI10 4.2.2 only)	N/A (not ported)	FALSE	TRUE
rn_avm0	$${\color{blue}preexisting}$$	1.40E-06	1.40E-06	1.20E-04
rn_avt0	$${\color{blue}preexisting}$$	1.00E-10	1.00E-10	1.20E-05

Sea-Ice parameters

parameter	parameter status in NEMO5 beta	GOSI10 at NEMO5 beta	GOSI10 trial at NEMO 4.2.2 (for comparison with GOSI10 at NEMO5 beta)	GOSI10 at NEMO 4.2.2 (standard)
rn_cio= 1.0e-2	$${\color{orange}renamed}$$	rn_Cd_io= 1.0e-2	rn_cio= 1.0e-2	rn_cio= 1.0e-2
ln_icedS	$${\color{red}removed}$$	N/A	ln_icedS=.true.	ln_icedS=.true.
ln_flushing	$${\color{green}new}$$	TRUE	Not available as option in the namelist but included in the code	Not available as option in the namelist but included in the code
ln_drainage	$${\color{green}new}$$	TRUE	Not available as option in the namelist but included in the code	Not available as option in the namelist but included in the code
rn_sinew	$${\color{green}new}$$	when using nn_icesal=2, initially set to 0.3 as per recommendation for NEMO5beta. then set to 0.25 as per reccomendation for NEMO5rc. rn_sinew *sss_1d is used instead of rn_simax, so will give different result from 4.2.2	N/A	N/A
rn_simax	$${\color{red}removed}$$	N/A	20	20
rn_sal_himin	$${\color{green}new}$$	0	Hardcoded to 0.	Hardcoded to 0.
ln_pnd_rain	$${\color{green}new}$$	FALSE	N/A	N/A
rn_pnd_hl_min	$${\color{green}new}$$	0.005	Not available as parameter, but hardcoded to 0.005	Not available as parameter, but hardcoded to 0.005
rn_pnd_hl_max	$${\color{green}new}$$	0.015	Not available as parameter, but hardcoded to 0.015	Not available as parameter, but hardcoded to 0.015
rn_alb_lpnd	$${\color{green}new}$$	set to 0.36, it should do the same as 4.2.2	N/A	N/A
nn_liquidus	$${\color{green}new}$$	1	N/A	N/A
nn_pnd_brsal	$${\color{gray} N/A }$$ (GOSI10 4.2.2 only )	N/A, use nn_liquidus instead	0	0

isabellaascione commented 4 months ago

BUG

the parameter nn_fct_imp has been introduced with the 5-beta release. The default value is 1. To replicate the behaviour of 4.2.2, it should be set to 2. However this option seems to be broken. I let Daley know about this, and an issue should be opened on the main nemo repository.

update: Daley proposed a fix: https://forge.nemo-ocean.eu/nemo/nemo/-/issues/499

dcalve commented 4 months ago

The r5282_nemo5rc_changes_and_fix_intel_xios MOCI branch should be used to run GOSI10 after it has been rebased onto the NEMO 5 release candidate (currently it is a branch of the beta).

This is required because the NEMO 5 release candidate has removed lib_cray.f90, necessitating an update to the fcm-make configurations in MOCI. The above branch contains these changes as well as those from r4684_fix_intel_xios, which we currently use.

Suites using GOSI10 based on the NEMO 5 release candidate will need the following update in app/fcm_make_ocean/rose-app.conf:

config_root_path=fcm:MOCI.xm/branches/dev/daleycalvert/r5282_nemo5rc_changes_and_fix_intel_xios
nemo_version_file=nemo5.0-version.cfg

isabellaascione commented 4 months ago

The NEMO 5 RC is released.

moving from NEMO5-beta to NEMO5-RC ,

the new reference namelist says to use 0.25 instead of 0.3 (as in NEMO5 beta) when using nn_icesal=2

Additionally, the following changes happened to the namelists:

namhsb	removed namelist
ln_diahsb	removed

namdom
ln_crs	removed

namcfg
ln_use_jattr	removed

namcrs	removed
nn_factx	removed
nn_facty	removed
nn_binref	removed
ln_msh_crs	removed
nn_crs_kz	removed
ln_crs_wn	removed

namc1d	it now includes the parameters from the namelists namc1d_dyndmp and namc1d_uvd, which were present in 5-beta. namc1d_dyndmp and namc1d_uvd no longer exist as namelists


nam_tide
ln_tide_dia	removed

namtra_eiv
ln_ldfeiv_dia	removed


namflo	removed
ln_floats	removed
jpnfl	removed
jpnnewflo	removed
ln_rstflo	removed
nn_writefl	removed
nn_stockfl	removed
ln_argo	removed
ln_flork4	removed
ln_ariane	removed
ln_flo_ascii	removed

nam_dia25h	removed
ln_dia25h	removed

namnc4
ln_nc4zip	default is .false.

isabellaascione commented 3 months ago

I tested the GOSI10 p2.0 with ORCA025. In the run with NEMO 5 beta I was using rn_sinew=0.3 with nn_icesal=2, as recommended in the NEMO5-beta release. The run crashed after about 25 years, in July 2001, with errors in outputting variable sfxice.

looking into the code, the nn_icesal =2 option is no longer coded as it was in NEMO 4.2.2. I did 1 day run test and compared the variable sfxice with the reference at GOSI10 p2.0 . The results (below, picture to the left) show that the nn_icesal=2 with rn_sinew=0.3 produce different results from GOSI10 p2.0 (middle picture).

I backported the formulas used in NEMO 4.2.2 into the branch (https://forge.nemo-ocean.eu/devs/ukmo/gosi/gosi/-/commit/b5e4620c40b9fc064630b0007a22d58c5effa9b8) . With this change, nn_icesal=2 option no longer uses the parameterisation based on rn_sinew.

the results in the picture below show that the sfxice variable (right picture) is consistent with GOSI10p2.0 (picture in the middle) when using the formulation from NEMO4.2.2

With NEMO5 RC the recommendation is now to use rn_sinew=0.25 with nn_icesal=2.

Emma also advised to try nn_icesal=4, which uses a default of rn_sinew=0.75 in the reference namelist.

Following the above, I set up the following 3 long runs:

suite	description	outcome
u-di347	GOSI10pX.0 (based on NEMO5RC). nn_icesal=2 and rn_sinew=0.25	this run crashed shortly after 20 years , in April 1997, with error `NetCDF: Numeric conversion not representable Unable to write data given the location id: 65536 and the variable whose id: 58 and name: sfxice`
u-di581	GOSI10pX.0 adapted to NEMO5RC , using nn_icesal=4 and rn_sinew=0.75	crashes on 19890601 wuh same error as above
u-di568	GOSI10pX.0 adapted to NEMO5RC , but backporting the formulations used in NEMO 4.2.2. Using nn_icesal=2 and re-introducing rn_simax=20.0 . rn_sinew is not used in this run	crashes on 19890601 wuh same error as above

isabellaascione commented 3 months ago

@emmafiedler

@edblockley

@alex.west49

isabellaascione commented 2 months ago

NEMO system meeting on 04/09/2024

there has been discussion about validation effort. the full minutes of the meeting can be found at (login required): https://forge.nemo-ocean.eu/nemo/nst/-/wikis/meetings/2024_09_4th

Notes for the validation efforts (from the minutes):

CNRS:

ran ORCA1 for 120 years for v4.2 (MLF) and v5.0 (MLF and RK3). Found an issue with sea-ice quantities when using RK3: under investigation (eORCA1 monitoring here).

no particular issue found in coupled ocean-atmosphere global and regional configurations with NEMOv5.0 (But not check MLF vs RK3).

Used XIOS3 following the instructions of Andrew on the draft user guide (here and it worked really well)

NOC:

ORCA025, not noticed any problems for first 5 years.

CMCC:

Did not notice any issues for Black & Mediterranean sea regional configs during the first year of simulation wave coupling will be validated soon

UKMET:

GOSI10 config: ORCA1,ORCA025 and ORCA12 grids (not exactly identical to v5.0 RC tagged version) crash after 20 years integration with ORCA025 using MLF (and delayed communications disabled)
similar AMOC between NEMO 4.2 and 5.0
SSS differences dipole between Artic & North Atlantic
global scores improvement with Nemo 5
restartability issue to be investigated by Daley; it can be linked to the crash Clem suggests to have a look at ice namelists between Nemo 4.2 & 5.0 and to check salt-ice related options

isabellaascione commented 2 months ago

NEMO system meeting on 25/09/2024

the full meeting minutes are available at (login required):

https://forge.nemo-ocean.eu/nemo/nst/-/wikis/meetings/2024_09_25th

notes related to validation efforts:

LOCEAN: Need to re-run the simulations to double check as the results are curious and not reproducible by the other groups.
NOC: run BENCH (eORCA12 like, no output, north fold) on ARCHER2: In terms of performance results are very good (still have benefit up to 60 000 cores) North fold still costing a lot => leads to up and down in the scalability curve. This depends of the number of lines in the north fold subdomains. Some work is still needed to improve that.
MetO: 25 years, then instability possibly in SI3. Not able to restart beyond because of the instability. On the 25 years of simulation, generally more ice in NEMO 5.0 compared to NEMO 4.2. Instabilities sensitive to compilation options (fp-model). XIOS failed but because SI3 send too large value. This is a model error not a XIOS error. S.T offered to look into the issue.
Issue with RK3: reconsidered the way Coriolis forces are computed in the time splitting loop. The branch is there and needs to be validated by others. (S.T. and G.M.). It improves stability of ORCA2. The issue is there for RK3 and MLF.

dcalve commented 1 month ago

Investigation into the stability issue

This is a summary of my investigation into the issue described in Isabella's comment above.

I was able to replicate the issue in an experiment without the GOSI10 code (so just using the NEMO 5.0 release candidate). The following findings are based on this configuration.

The issue

The following error:

NetCDF: Numeric conversion not representable Unable to write data given the location id: 65536 
and the variable whose id: 58 and name: sfxice

occurs when XIOS is unable to convert data to the precision required by the netCDF file when writing to it. This usually means that the data contains NaNs, infinite values, or very large/small values.

In this case, the reported diagnostic (sfxice) uses data from the SI3 salt flux sfx, which did indeed have very large values at some point during the cycle that crashed.

Source of the issue

The NEMO domain and north fold

The NEMO grid is split into two parts:

"Internal"/"computational" domain- the "physical" part of the global domain
Halo points- lateral boundary conditions copied from data in the internal domain

In a global ORCA configuration, the east and west boundaries are periodic, the south boundary is closed and the north boundary is periodic with a north fold. The following figure shows a schematic of the north boundary for the U grid on a T-point north fold:

In this figure, the internal domain is the area below the thick black line between rows jpj-1 and jpj-2. The halo points are copies of this data (compare cells with the same colours), mirrored around the central columns (jpiglo/2 and jpiglo/2+1) and across the top row of the internal domain (jpj-2) so that the sign is changed- this is the north fold.

This particular north fold type (mirrored along T grid points) means that the right half of the top row of the inner domain contains duplicate data from the left half- this is shown by the light grey shading. I will refer to this specific part of the north fold as the "internal north fold" below.

The internal north fold is not normally updated by lbc_lnk, i.e. the left half of row jpj-2 is not copied onto the right half, as it is part of the internal domain. This assumes that the results of the calculation will be identical on both halves of the row, but this isn't necessarily true. In fact, this point is acknowledged within the NEMO code itself:

! In theory we should exchange only nn_hls lines.
!
! However, some other points are duplicated in the north pole folding:
!  - c_NFtype='T', grid=T : half of the last line (jpiglo/2+2:jpiglo-nn_hls)
!  - c_NFtype='T', grid=U : half of the last line (jpiglo/2+1:jpiglo-nn_hls)
!  - c_NFtype='T', grid=V : all the last line nn_hls+1 and (nn_hls+2:jpiglo-nn_hls)
!  - c_NFtype='T', grid=F : all the last line (nn_hls+1:jpiglo-nn_hls)
!  - c_NFtype='F', grid=U : no points are duplicated
!  - c_NFtype='F', grid=V : half of the last line (jpiglo/2+1:jpiglo-nn_hls)
!  - c_NFtype='F', grid=F : half of the last line (jpiglo/2+1:jpiglo-nn_hls-1)
! The order of the calculations may differ for these duplicated points (as, for example jj+1 becomes jj-1)
! This explain why these duplicated points may have different values even if they are at the exact same location.
! In consequence, we may want to force the folding on these points by setting l_full_nf_update = .TRUE.
! This is slightly slower but necessary to avoid different values on identical grid points!!

This characteristic of the internal north fold is relevant in the next section, which describes the source of the issue.

Instabilities in sea ice velocity on the north fold

The large sfx values were tracked down to large sea ice velocities (~45m/s) calculated by the SI3 EVP rheology scheme (icedyn_rhg_evp.F90) that consistently appeared on the internal north fold. By comparison, the corresponding points on the other side of the north fold had much more realistic velocities. These differences manifest quickly- within 120 solver substeps on the first timestep- and the velocities on the internal north fold grow over subsequent timesteps.

These differences can be eliminated by adding the ldfull=.TRUE. argument to the lbc_lnk calls for u_ice and v_ice in icedyn_rhg_evp.F90. This argument causes data on the internal north fold to be updated from data on the other side of the north fold. Testing shows that this is sufficient to prevent the growth of instabilities in the calculation of u_ice and v_ice (using the global maximum Courant number as a measure).

In summary, the calculations underpinning u_ice and v_ice give very slightly different results on the internal north fold, compared to those on the other side of the north fold. The next section outlines a theory as to why.

It isn't clear to me why these differences would amplify over time- perhaps someone more familiar with the numerical aspects of SI3 would have an idea?

Underlying cause

Floating point arithmetic consistency and Fused Multiply Add operations (FMAs)

Floating point arithmetic is not exact- a+b is in practice equal to a+b+E(a,b) where E is a rounding error. Crucially, this means that floating point arithmetic is usually neither associative nor distributive. For example, a+b+c requires two operations to calculate, which could be represented by the compiler as either (a+b)+c or a+(b+c). Because the rounding error in each case is different, the results will not be exactly the same. For this reason, brackets may need to be inserted to ensure consistent results by fixing the order of operations.

Fused Multiply-Add operations (FMAs) are an optimisation that allow arithmetic of the form a*b+c to be calculated using a single floating point operation. This is faster and more accurate than using two operations, i.e. (a*b)+c, but is no less prone to giving inconsistent results.

For example, c = a**2 - b**2 could be calculated by the compiler as a series of 3 operations without FMAs: c = (a*a) - (b*b). With FMAs this is reduced to 2 operations- one of which is an FMA- but there is ambiguity in the order of operations:

d = a*a
c1 = d - b*b        ! FMA

d = b*b
c2 = a*a - d        ! FMA

c1 and c2 will not be exactly the same, again due to differing rounding errors in the floating point calculation. In fact, c1 or c2 may be negative even if a == b.

Role of FMAs in the instability issue

The calculations of u_ice and v_ice on either side of the internal north fold give consistent results (and prevent the instability) when one of the following Intel compiler flags is used:

-O0 instead of -O2
-fp-model consistent instead of -fp-model precise
-no-fma

Each of these flags disables FMAs, implying that FMAs are responsible for the differing results on either side of the north fold. Why? My theory is that this is due to a differing order of operations, similar to an issue we encountered when moving from 1 row to 2 rows of halo points in NEMO 4.2.

Take the following arithmetic:

c(ji,jj) = a(ji,jj-1) + a(ji,jj) + a(ji,jj+1)

This may be calculated by the compiler as:

c(ji,jj) = a(ji,jj-1) + ( a(ji,jj) + a(ji,jj+1) )

On the other side of the north fold the calculation will be mirrored, so that a(ji,jj-1) and a(ji,jj+1) values are swapped. The calculation then effectively becomes:

c(ji,jj) = a(ji,jj+1) + ( a(ji,jj) + a(ji,jj-1) )

This gives a slightly different result, because the order of operations has changed. Brackets must therefore be inserted into the expression to fix the order of operations and ensure consistent results on either side of the north fold.

lbc_lnk copies data from the internal domain on one side of the north fold onto the halo points on the other side. This eliminates any difference in results on either side of the north fold. When moving to 2-point haloes, many lbc_lnk calls were replaced by direct calculation on the halo points. This led to results changing and required brackets to be inserted at various points in the code to recover the previous results.

I believe the same thing is happening with FMAs in icedyn_rhg_evp.F90- they are giving different results on either side of the north fold because the order of operations is different.

Solutions

As discussed previously, brackets could be inserted to ensure a consistent order of operations, in this case with the aim of forcing the compiler to consistently implement FMAs in an arithmetic expression. However, this is not feasible for several reasons:

It is hard to tell where the compiler will implement FMAs in complex expressions, requiring a trial and error approach
Some complex expressions are hard to refactor efficiently
It is likely that a lot of the NEMO code would have to be modified
FMAs are a compiler optimisation and as such, their behaviour may be compiler and architecture-dependent

This issue is likely to be exclusive to our (the Met Office's) compiler and platform, meaning that it probably won't follow us onto the Gen 1 HPC (which does not have the Intel compiler). Therefore, a better solution is to use one of the Intel compiler flags that disable FMAs. This may slow the model down, but I think this would be an acceptable compromise for the short time we will be on the current HPC.

My suggestion is to use -fp-model consistent (replacing -fp-model precise) rather than -no-fma, since the latter flag is set by the former, and the former flag is used by other groups in the NEMO consortium.

Tests in progress and outstanding issues with sea ice conservation

Several experiments based on my NEMO 5.0rc "GOSI10-like" suite (u-di687) are currently running for 30 years, each of which is trialling a different fix. These experiments were all run from the same suite, but are being archived under different experiment IDs:

u-di687- using the fix suggested to Isabella (call the lbc_lnk for u_ice/v_ice in icedyn_rhg_evp.F90 with ldfull=.true., but only for the last substep of the solver)
u-dj999- using -fp-model consistent instead of -fp-model precise
u-dk000- as for u-di687 but call the lbc_lnk with ldfull=.true. for all substeps of the solver

All experiments failed at some point with an ice grid-point conservation issue. These conservation diagnostics are enabled with ln_icediachk=.true. and check that spurious changes in ice mass, salt and heat do not exceed a given tolerance. Globally-integrated changes and changes within a grid cell have different tolerances, with the latter being more strict and raising an error while the former only prints a warning. Examples of these warnings are:

iceupdate : violation mass cons. [kg] =    32780928625.5002     
iceupdate : violation heat cons. [J]  =  -4.232530907118238E+015
iceupdate : violation mass cons. [kg] =    25777135366.6723     
iceupdate : violation heat cons. [J]  =  -4.261563321882954E+015

We do not normally enable these checks- I'm not sure why. I enabled them while looking into the instability issue since they served as a useful diagnostic. I kept them enabled in the above 30-year tests in the hope that the conservation warnings would be resolved by the fixes being tested. However, while these warnings have been significantly reduced in frequency- they now appear for a couple of months every 2-6 years rather than most months every year- they are still occurring. Eventually, the grid-point conservation checks stopped all 3 experiments.

I believe these conservation issues are another characteristic of the Intel compiler. I have been running GOSI9p8.1 (based on NEMO 4.0.4) with the Cray (u-dk107) and Intel (u-dk108) compilers. The Intel compiler outputs similar warnings from the sea ice conservation diagnostics (it reports negative enthalpies rather than spurious mass/heat), but the Cray compiler (the standard for GOSI9 suites) outputs no warnings at all.

Given that we will be moving to Gen 1 and away from the Intel compiler, it's likely that the conservation warnings will disappear. So I mention this as something to keep an eye on when we start running on Gen 1, but I suspect it will no longer be an issue.

I have continued u-di687 and u-dj999 without the sea ice conservation diagnostics for the sake of completing 30 years. These should tell us if there is any systematic difference between these two fixes in terms of scientific results, as well as whether -fp-model consistent will significantly affect run times.

A related issue- restartability

In the course of investigating the above issue, I also noticed that NEMO 5.0rc was not restartable- i.e. a 30-day run did not give the same results as 3 10-day runs. This issue has also been traced back to the use of FMAs and the same compiler flags resolve it, although I don't know whether the issue is specific to the Intel compiler.

As for the other issue, it's likely that this one won't follow us onto Gen 1, so I again recommend using -fp-model consistent instead of -fp-model precise in the compiler flags.

isabellaascione commented 2 weeks ago

Preliminary results running GOSI10 based on NEMO5 rc : ValidationJMMP_update_Nov.pptx

full validation notes:

ocean: https://gws-access.jasmin.ac.uk/public/jmmp/valid_ocean/u-dk027_10y_vs_u-dh157_10y/assess.html

sea-ice: https://gws-access.jasmin.ac.uk/public/jmmp/valid_ice/u-dk027_10y_si3_vs_u-dh157_10y_si3/assess.html

JMMP-Group / GO_coordination

GOSI10 porting to NEMO5 #22