aekiss commented 6 years ago

ACCESS-OM2-01 crashed on 12 Aug in both year 3 and 4 when running with dt=300s. The error reported in access-om2.err in year 3 is:

Warning: Departure points out of bounds in remap
my_task, i, j =        1179          17          62
dt, uvel, vvel =    300.000000000000       0.569499526003574     
  3.71828483289453     
dpx, dpy =  -170.849857801072       -1115.48544986836     
HTN(i,j), HTN(i+1,j) =   4208.11316624741        4214.00348097445     
HTE(i,j), HTE(i,j+1) =   1030.78296224141        1030.90528998208     
istep1, my_task, iblk =      274579        1179           1
Global block:        1180
Global i and j:        1726        2671
remap transport: bad departure points
forrtl: error (78): process killed (SIGTERM)

bad departure points error is triggered in ice_transport_remap.F90. Seems to occur when advective velocity is excessive, carrying ice more than one grid cell (line 1563, subroutine departure_points: https://github.com/OceansAus/cice5/blob/49b36d4bfb97328e818d428d0b8438144dbd69a1/source/ice_transport_remap.F90#L1563). vvel = 3.71828483289453 (m/s presumably) - seems pretty fast yields dpy(i,j) = -dtvvel(i,j) = -1115.48544986836 < -HTE = -1030. Location (i,j)=(1726,2671) is in the Canadian Archipelago, apparently southeast of Victoria Island. HTE = 1030m is a very small grid spacing for the ocean - I guess the grid is extra fine there near the tripole. Year 4 crashed on the same day (12 Aug) at nearly the same point, (i,j)=(1720, 2676). `/g/data1/ua8/JRA55-do/RYF/v1-3/.1984_1985.ncseems to show the passage of a low pressure system (~990hPa) on 11-12 Aug in that location, with strongish (~10m/s) winds swinging from southeasterly to southwesterly. Crashed runs are stored in/short/v45/aek156/access-om2/control/01deg_jra55_ryf/work-3208848and/short/v45/aek156/access-om2/control/01deg_jra55_ryf/work-3322545`

Reducing timestep to dt=240s allowed the simulation to progress through Aug in both year 3 and 4. It is happy with dt=300s in every other month and did not crash in Aug of year 1 or 2.

It would be good to have dt=300s (or more!) for all months. Should we consider Rayleigh drag in this region, or changing the topography?

This same error seems to have also occurred in May-June of year 5 with dt=450s, although at a different location (i,j)=(3515, 2297), at the head of the Gulf of Ob - see https://github.com/marshallward/payu/issues/96

nichannah commented 6 years ago

@aekiss when you get time could you take a look at the section "Choosing an appropriate time step" in the CICE manual, http://oceans11.lanl.gov/trac/CICE/attachment/wiki/WikiStart/cicedoc.pdf?format=raw ?

Our current configuration looks OK to me however it does indicate that we could try modifying ndtd and/or ndte in the namelist. This is something that I did with the 0.25 ACCESS-CM, it can slow things down.

nichannah commented 6 years ago

Siobhan said:

"Hi Andrew we had this on different grid (1degree ) it may be the channel issue as you say, it may be the ocean velocities have gone unstable as well, it was often near the start of the run when the concentrations were very low so they were accelerating very fast we did have upper limits set I though would work for all points but obviously not these. "

aekiss commented 6 years ago

Thanks @nicjhan

I've tested ndtd = 2 (instead of 1). It runs ~1.7x slower so I've had to kill that test run and split it into individual months. These are queued, so I haven't yet confirmed that this fixes the crash (though I'm hopeful it will).

This seems to indicate that CICE runtime is roughly proportional to ndtd, so we'd need to double its ncpus to 2400 to balance the load with MOM (optimistically assuming it scales linearly with ncpus), ie a 22% increase in total cpus. However if this also allows a longer dt (say 450s instead of the current 300s) it would make it cheaper to run overall (but harder to get through queue).

aekiss commented 6 years ago

Good news - running with ndtd = 2 (instead of 1) ran through Aug of year 5 with dt=300s, which wasn't possible with dt=240s in year 3 or 4. Output from this run (3898145) is in /short/v45/aek156/access-om2/control/01deg_jra55_ryf/archive/output031

It is slower though: 1mo with ndtd=2: 01deg_jra55_ryf.o3898145: Walltime Used: 03:28:24 cf 1mo with ndtd=1: 01deg_jra55_ryf.o3874070: Walltime Used: 02:32:07

aidanheerdegen commented 6 years ago

In this documentation:

https://ncar.github.io/CICE/doc/build/html/users_guide/ice_nml_var.html#model-physics It says:

The calculation of the ice velocities is subcycled ndte times per timestep so that the elastic waves are damped before the next timestep. The subcycling timestep is calculated as

dte = dt / ndte

and must be sufficiently smaller than the damping timescale T, which needs to be sufficiently shorter than dt.

dte<T<dt

This relationship is discussed in [6]. The best ratio for [dte : T : dt] is [1 : 40 : 120]. Typical combinations of dt and ndte are (3600., 120), (7200., 240) (10800., 120). The default ndte is 120 as set in ice_init.F90.

The pdf doc quoted above says:

In practice, the ratio ∆te : T : ∆t = 1 : 40 : 120 provides both stability and acceptable efficiency for time steps (∆t) on the order of 1 hour.

Now according to @aekiss namelist comparison tool, the values in the models are:

Variable	1 deg	0.25 deg	0.1deg
dt	3600	1200	300
ndte	120	120	120

There is also this:

Q: EVP time steps Although the CICE documentation gives good hints how to choose the main time step dt, it is not clear to me where the ratio 1:40:120 is derived. I would be really grateful if you could give some advice how to make "the best" choice for the EVP sub-time stepping and value of E0.

A: The subcycling step in EVP is really an iteration to damp out the elastic waves, which are generated every time the mass/strength changes. The ratio, 1:40:120, is essentially the subcycling time step to the damping timescale to the full time step, the full time step being defined by when the mass, strength and surface forcing changes. We need the damping timescale to be sufficiently shorter than the full time step so that the elastic waves CAN damp out within it, and we need the subcycling timestep sufficiently shorter than the damping timescale to actually accomplish the damping. A choice of 1:80:240 is better in the sense that you'll damp the elastic waves more completely, but on the other hand it will be more expensive to run---the parameter choices are really a balancing act between how small the residual error is and how many computational cycles you're willing to spend to get there, as is often the case in numerical modeling. When the ice strength P is very large, the elastic waves are larger and more difficult to damp out. The elastic wave damping vs ice strength issues are more fully explained in this paper:

Hunke, E. C. Viscous-Plastic Sea Ice Dynamics with the EVP Model: Linearization Issues. Journal of Computational Physics, 170, 18–38, 2001.

E0 (eyc in the code) is the ratio (damping timescale)/(full time step). The other namelist parameter used for EVP, ndte (the number of subcycles per timestep) is an integer to ensure that the subcycling steps dte evenly divide the full timestep.

The "revised EVP" method, developed by Sylvain Bouillon et al., provides an alternative subcycling approach. See the CICE documentation cicedoc.pdf for details, and

Bouillon, S., T. Fichefet, V. Legat, and G. Madec. The revised elastic-viscous-plastic method. Ocean Modelling, submitted, 2013.

I wonder if ndte is too high? If that could be reduced then ndtd could be increased without penalty

ofa001 commented 6 years ago

I think it would be just swings and roundabouts altering ndtd is the obvious change here, 120 iterations is the normal amount the real issue is that by having the tripole so close to this point you have very small grid resolution O (1km) so it wants to move more one grid box when there is a strong low pressure system there, you could limit the max velocity that it can calculate, we had already done that for the 1 degree and I think Nic put it in the 0.25 system 3m/s is too fast. I will look up the code. its probably in already if you based it on the 0.25 version, you may want to set different limits to get round this point. I don't think we want to shift the tripole a bit further into the land

ofa001 commented 6 years ago

Was there anything in ice_dyn_shared in subroutine stepu to limit the ice velocity

ifdef AusCOM

!20160624 -- Siobhan and Dave's idea to set ice velocity limit to avoid !transport remap "departure point error": !if (abs(uvel(i,j) >= vel_max) uvel(i,j) = sign(c1, uvel(i,j)) ! vel_max !if (abs(vvel(i,j) >= vel_max) vvel(i,j) = sign(c1, vvel(i,j)) ! vel_max uvel(i,j) = sign(min(abs(uvel(i,j)),vel_max),uvel(i,j)) vvel(i,j) = sign(min(abs(vvel(i,j)),vel_max),vvel(i,j))

endif

vel_max was set to 5.0m/s but it could be set lower in your case as it looks like that was still to high in the Canadian Archipelago with such small grid boxes. Our case was the open ocean at the start of spin up.

aidanheerdegen commented 6 years ago

My point, though badly made, was the documentation says

"In practice, the ratio ∆te : T : ∆t = 1 : 40 : 120 provides both stability and acceptable efficiency for time steps (∆t) on the order of 1 hour."

We are not in a regime with time steps on the order of 1 hour. So I wondered if there had been recommendations for other regimes.

What is the damping timescale for this regime? If we don't need to iterate for as long to damp, then we can boost ndtd. Or are these essentially interchangeable?

ofa001 commented 6 years ago

I think from my reading and looking at all the other places in the code when they are used they are interchangeable, Aidan, Its just been traditional to have 120 iterations on the evp loop in most circumstance I have seen. You could give it a try don't think it will alter things and it might be a while before you spot instabilities showing up. Not as obvious as this. Limiting the velocities for this small grid spacing makes more sense we did it in the coupled model at the initialisation when the atmosphere and ocean where slightly mismatched and strong winds were over newly freezing ice with very low concentration so the ice shot off at speed.

The point about shifting the tripole, when we put it in the Canadian Archipelago we were at 1 degree other groups have it slightly shifted more into Canada. So would have been less islands to worry about and less passages, but we were always worried about the effect of those smallest grid boxes.

aidanheerdegen commented 6 years ago

Thanks for the feedback Siobhan. Reducing the max ice velocity as you suggest sounds like the best way forward.

aekiss commented 6 years ago

Thanks Siobhan. I was thinking of using Rayleigh damping in this region. Would that be preferable to a velocity cap?

aekiss commented 6 years ago

EDIT: the following contains mistakes, as I missed the fact that dt has multiple definitions in the cice code (it can be either the thermodynamic or dynamic timestep). See https://github.com/CICE-Consortium/CICE/pull/83

========================

Thanks Aidan, good question, I've had a close look at this now.

Short answer - no, we shouldn't decrease ndte when we decrease dt (but we may need to increase ndte in proportion to ndtd).

The details:

Firstly, there are mistakes in sec 4 of the CICE manual, which uses dt_dyn in several places where it should be dt for consistency with the code (and sec 3.5 in the manual).

We are using "classic EVP" (kdyn = 1, revised_evp = .false.). Not sure if that's an optimal choice, but that's a separate question. dt_dyn = dt/ndtd T = eyc.dt is elastic damping timescale (NB: sec 4.4 incorrectly states T = eyc.dt_dyn; see ice_dyn_shared.F90) dte = dt/ndte (NB: sec 4.4 incorrectly states dte = dt_dyn/ndte; see ice_dyn_shared.F90)

so dte : T : dt = dt/ndte : eyc.dt: dt = 1 : eyc.ndte : ndte

We don't explicitly set eyc, but the default is 0.36. With ndte=120 we get dte : T : dt = 1: 43.2 : 120 which is pretty close to 1 : 40 : 120 The values are not critical but I suppose we could set eyc=1/3 to get exactly 40 if we wanted.

Note that dte : T : dt = 1 : eyc.ndte : ndte is independent of the thermo timestep dt so we shouldn't reduce ndte for the high-resolution runs which have smaller dt. It makes sense that the optimal dte : T : dt is independent of dt, as the elasticity does not model any real effect - it is just a numerical trick to efficiently get to the visco-plastic solution (I presume it's done to convert the problem from elliptic to hyperbolic), so T must always be signficantly less than dt so that the thermo doesn't see the fake elastic waves. dte must also be much less than T so those fictitious elastic waves are well simulated.

I think the same argument should also apply to the dynamics timestep, ie we should have T : dt_dyn much less than 1 (say, 1:3 to match T : dt) so that the dynamics also doesn't see the elastic waves. If this is the case we should increase ndte in proportion to ndtd - which would be unfortunate, as it would slow things down even more.

ofa001 commented 6 years ago

In Answer to aekiss Raleigh friction will slow down the ocean velocity under the ice but is sounded like this problem was set off by an atmospheric low pressure system so you may still need a velocity cap in the ice as the ocean velocity may be acting to slow the ice down not speeding it up, depending on the balance of terms, locally.

aekiss commented 6 years ago

Hi @ofa001 our present code does not include a vel_max velocity limiter in subroutine stepu but I could add one and a namelist variable vel_max. However I wonder whether this limiter implementation might cause problems, as it limits the velocity components individually and so can change the direction of the velocity. I would be more comfortable with limiting the velocity magnitude by scaling both components so the direction is preserved. Even so, this would alter the ice divergence, potentially causing problems. Did you find problems of this sort?

aekiss commented 5 years ago

Closing this old issue.

We have found ndtd = 2 or 3 allows dt=450s or more (often up to 600s). We have not changed ndte.

COSIMA / access-om2

"bad departure points" error in CICE #78

ifdef AusCOM

endif