ESCOMP / CAM

Community Atmosphere Model
72 stars 136 forks source link

CAM Crashes with 58 levels and higher horiz resolution #442

Closed andrewgettelman closed 2 years ago

andrewgettelman commented 2 years ago

Opening an issue to describe crashes with high vertical resolution.

So far this has only been seen with higher resolution simulations, and with CAM-MPAS.

The basic test case is 58L CAM-MPAS aquaplanet crashes almost immediately with an error from CLUBB:

The errors are coming out of CLUBB ( we are not necessarily convinced it's CLUBB's fault yet) in advance_windm_edsclrm_module.F90.

The error is:

405: Fatal error solving for eddsclrm 405: Error in advance_windm_edsclrm

The error has been seen by @skamaroc and Xingying Huang (not sure their github names yet).

Vince Larson notes that:

edsclrm is CLUBB's array of scalars diffused by CLUBB's eddy diffusivity. windm is CLUBB's representation of the horizontal wind components. ("m" in CLUBB-speak denotes "grid-mean." So "windm" refers to the grid-mean values of u and v.)

I am guessing that the wind is the problem, not the eddy scalars, which are chemical species, etc.

The cause could be initial conditions (not initializing CLUBB variables). Or it could be upstream of CLUBB (and be the input winds).

Still trying to debug....

andrewgettelman commented 2 years ago

Thanks @adamrher, super helpful for reminding us. I think we can check the dry dep code (which does look problematic, and probably needs to be rewritten to get heights if appropriate, or make sure they do not go negative...).

I also agree that Bill is probably on to something as well with the CLUBB solver, and that could be adjusted.

andrewgettelman commented 2 years ago

From @nusbaume 👍

I am pretty early in my effort (I think you are all ahead of me), but I just wanted to say that in my run the NaN is coming into CLUBB via the state%q variable with constituent index 12 (i.e. DMS). So I can at least say that CLUBB isn't just randomly producing a NaN internally, but is instead being fed one (which would possibly point to dry deposition being the culprit).

I'll hold off for now until I hear how the dry deposition check goes. If turning off dry deposition doesn't work, or if you run into a new error, then just let me know and I will keep going with the debugging.

xhuang-ncar commented 2 years ago

@andrewgettelman my earlier comment seems to have gotten lost in the thread:

I thought we has some action items from the mtg last Tuesday? We discussed trying to comment out the dry deposition code on account of this DEBUG=TRUE FPE error (see below). If you look at the actual line that's triggering the FPE (/glade/work/xyhuang/CAM-1/src/chemistry/mozart/mo_drydep.F90:1110), lots could be going wrong.

And here is the line

   cvarb  = vonkar/log( z(i)/z0b(i) )

where z is an elevation derived from hydrostatic balance, and so uses the surface pressure.

Commenting out this line, the model did not get to run. Is there other way to turn this off?

In parallel, I think we should also consider @skamaroc point, that the time integration in CLUBB eddy diffusivity may be producing oscillations that does not play well with the MPAS solver. On our end, I suggest we continue with the other action item from last weeks mtg:

That's insightful. I am also checking out the state variables at each time step before the crash, I will update later.

And then we determined that the RHS variable is all NaNs in clubb's scalar mixing subroutine, and so some further debugging there would be helpful too (perhaps as Vince suggests, pinpointing where in the column a bad value is triggering a whole column of NaNs). @xhuang-ncar any updates?

Jesse is working on this and has figured out that the NaN is coming into CLUBB.

adamrher commented 2 years ago

Commenting out this line, the model did not get to run. Is there other way to turn this off?

The z/z0 is the height (above the surface?) of the lowest model level, divided by a roughness length z0. Reasonable values would be ln(50 m / 0.1 m) ~ 6. So why not just set this line to:

   cvarb  = vonkar/6._r8
dan800 commented 2 years ago

@xhuang-ncar Last week we had discussed simply commenting out the call to the dry deposition routine. Was that tried?

vlarson commented 2 years ago

Question for @vlarson: I noticed that the vertical dissipation in CLUBB is implemented using semi-implicit Crank-Nicolson time integration. This can produce oscillatory behavior for large eddy viscosities (i.e. large K dt/dz^2). Is there a way we can change the weights on the time levels in the integration? It looks like the weights (1/2, 1/2) are hardwired in the code. It may be that the vertical discretization in the nonhydrostatic solver in MPAS, that uses a Lorenz vertical staggering, is not happy with oscillations that may be produced by the mixing in these thin layers. I'm also wondering about potential feedback from the mixing and the nonlinear computation of the eddy viscosities.

Brian Griffin (@bmg929) and I looked at the advance_windm_edsclrm code. We'll think about how best to relax the hard-wiring of the weights. There are a number of places where that would need to happen.

For my own reference, I'll take this opportunity to reference possibly related tickets (#342 and E3SM-Project/E3SM#4073).

Do we know whether the bad values originate near the lower surface or aloft?

xhuang-ncar commented 2 years ago

@adamrher @dan800 @vlarson & everyone. Sorry for a late catch up on this. I was taking some time off for personal business.

Here is what I got: from this problematic line: cvarb = vonkar/log( z(i)/z0b(i) ) I set the z(i) to 5.0, similar to what Andrew and Adam suggested. And that works. At least, the run (I am testing the 60km aqua one) is going on without crashing again. I will get a longer simulation and also track back to the bad z values (as in z(i) = - r/grav air_temp(i) (1._r8 + .61_r8spec_hum(i)) log(p/pg)).

Thanks all! Will update further.

xhuang-ncar commented 2 years ago

@adamrher and all. Good news. I think we found the issue (at least for now). I have tested both 60km and 60-3km runs for the five-day simulation. Both went well. So far, I am tracing back why the z values become invalid, and that relates to the p/pg (i.e. pressure at midpoint first layer/surface pressure). Still figuring out what's going on with the pressures though. Many thanks to Adam and everyone! I will update further as more tests are conducting.

andrewgettelman commented 2 years ago

Further updates:

  1. The two pressures put into the offending line in the wet deposition (p/pg) are pressure_10m and psurf. After a traceback through the dry deposition code interfaces it pressure_10m = pmid(:,k) i.e. the midpoint pressure of the bottom level. Sometimes this gets to be less than surface pressure. Why?
  2. Reporting from @skamaroc :

Here's what's going in the dp_coupling for the pressure. The interface pressures are calculated using the hydrostatic equation integrating down from the top of the model. The mid-level pressures are calculated using the state equation and the MPAS state , i.e. the mid-level pressures computed in dp_coupling are what the full nonhydrostatic pressures in the MPAS model. This difference in how the interface pressures and mid-level pressures are what can lead to the problem we're seeing here. Of course for the hydrostatic models this is not a problem because the hydrostatic relation is used to compute all the pressures. I'm not sure what the best way to deal with this is at this point.

So for the moment we can fix this lowest layer height in dry deposition at 5 or 10m and it runs. The medium term solution is to make sure the interface and mid-point pressure calculation in dp_coupling is more consistent for a non-hydrostatic model. This might take help from @PeterHjortLauritzen

THanks!

andrewgettelman commented 2 years ago

Reporting on results from @skamaroc :

Here's what's going in the dp_coupling for the pressure. The interface pressures are calculated using the hydrostatic equation integrating down from the top of the model.

The mid-level pressures are calculated using the state equation and the MPAS state , i.e. the mid-level pressures computed in dp_coupling are what the full nonhydrostatic pressures in the MPAS model.

This difference in how the interface pressures and mid-level pressures are what can lead to the problem we're seeing here. Of course for the hydrostatic models this is not a problem because the hydrostatic relation is used to compute all the pressures. I'm not sure what the best way to deal with this is at this point.

I'm assuming this is in: src/dynamics/mpas/dp_coupling.F90

I cannot quite follow the logic in the code (since I think the call is to hydrostatic_pressure for both pint and pmid).

But it seems like we need to make the pint and pmid consistent for MPAS. Should we use the full non-hydrostatic calculation and adjust DP coupling? Or is there a reason not to? Will need guidance from @skamaroc and @PeterHjortLauritzen on the path forward: But it seems pretty simple to adjust and then test.

Thanks!

PeterHjortLauritzen commented 2 years ago

looking into it

PeterHjortLauritzen commented 2 years ago

The current physics-dynamics coupling is discussed (in note form) in:

https://www.overleaf.com/read/jzjvmnqdxcfc

(equation numbers I refer to may not continue to match as the Overleaf document evolves - apologies!)

As described by @skamroc the issue is that the interface pressure is computed using hydrostatic balance (equation 8 in Overleaf notes) and the mid-level pressure is computed using the equation of state (equation 19 in Overleaf notes). The mid-level pressure is therefore the full non-hydrostatic pressure which is not necessarily bounded by the hydrostatic interface pressures if there is non-hydrostatic motion. If that is the case the state passed to physics is inconsistent (unphysical).

Why this choice and possible solution:

Why? (this is going to be technical) Choose your poison! Since CAM physics is pressure-based and hydrostatic, and MPAS is constant volume and non-hydrostatic, it is hard to make CAM-MPAS physics-dynamics coupling 100% consistent (unless CAM physics is reformulated as constant volume physics and relaxes the hydrostatic assumption). But some level of consistency can be enforced; in particular, one does not want the energy fixer to fix the wrong energy!

In cam6_3_032 I chose the following constraints and assumptions:

  1. When the MPAS state is passed to CAM physics the z-levels diagnosed from pressure, temperature and moisture in physics is consistent with MPAS z levels (that are constant in MPAS) - when this is done using the hydrostatic interface pressure and non-hydrostatic mid-level pressure as described above (more details in Overleaf doc) then this is fulfilled.

  2. The total energy increment from heating yields the same column energy change in pressure and z-coordinates (this assumes that there is no flux of heat in-out the model top which is different in CAM physics and MPAS; constant p versus constant z) since column heating should be independent of the vertical coordinate system.

  3. The energy change associated with water change in the column (dme_adjust) is computed consistently with hydrostatic MPAS (water removed at a constant MPAS z-level is not the same as removing water from a constant pressure level in CAM physics where the heights evolve throughout physics!). Note that we have chosen that physics tendencies in physics and dynamics are associated according to index (i.e. we are not mapping tendencies from the new z-levels in physics back to MPAS z-levels). Hence there is a splitting error here ...

  4. The total energy fixer needs to use an energy formula consistent with hydrostatic MPAS (note that the total energy fixer fixes energy errors in the dynamical core, 3., and physics-dynamics coupling).

  5. Neglect non-hydrostatic effects (note that the only difference between hydrostatic and non-hydrostatic total energy formulas is the kinetic energy of vertical velocity; likely a small term in a global energy budget).

  6. As with SE and FV3, we neglect the mass-effect of condensates in the total energy formula (this is a very small term globally; but could also quite easily be fixed!).

Solution to pressure problem: Height is a diagnostic field in CAM physics. The energy formula in CAM physics does not use height. So energetically speaking the heights don’t matter in terms of the energy budget. That said, the height is used by certain parameterizations (and diagnostics) as an input variable to do their thing (e.g. compute thresholds for some process). Hence we could choose to violate 1, i.e. the mid-level height will not be entirely consistent with z in MPAS computed from equation (1) in Overleaf document. Perhaps that is OK since we have a splitting error anyway (with physics evolving on constant pressure and applied at constant volume index-wise).

I propose we use a hydrostatic mean pressure computed using a finite-volume mean approach (A7 in Overleaf document). In this case the mid z-level will not exactly match MPAS but since it is a diagnostic variable in CAM physics maybe that is “good enough”. This will make sure that pmid is bounded by the interface pressures sacrificing the “z consistency.

Draft code with fix 2 fixes here (search for #define pmid_fix) - @skamroc please advice which you prefer:

https://github.com/PeterHjortLauritzen/CAM/blob/mpas_pressure_fix/src/dynamics/mpas/dp_coupling.F90

(also discovered indexing bug in energy diagnostics made during PR changes - this is fixed in this code to)

It has been verified that the energy consistency persists with the pressure fix.

PeterHjortLauritzen commented 2 years ago

Update on testing of proposed solution using full level pressure half way between hydrostatic interface pressures:

Ran 13 months (used last 12 months for analysis, L32, F2000climo compset, 1 degree).

With the fix the z-levels diagnosed in physics do not match MPAS z levels. The Figure below shows difference (in meters) between CAM physics z levels (based on hydrostatic integration) and MPAS z levels for "random" locations as a function of level:

cam phys z versus mpas z

Same Figure but normalized (not exactly same locations):

cam phys z versus mpas z - normalized

The differences are on the order of 0.1-0.2%. For comparison here are similar Figures but showing how much z-levels move during physics (z end of physics minus z beginning of physics normalized by z):

how much do levels move in physics on average - normalized

=> the levels move less due to physics processes compared to the discrepancy between MPAS z levels and CAM physics z levels computed as described above.


The z level discrepancy impacts the energy fixer in the geopotential term (rhogz*dz):

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

In all, the modification is acceptable in terms of energy budget closure, however, it would be more consistent to change the thermodynamic potential in physics so that z stays fixed (that is, however, a research project).

adamrher commented 2 years ago

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

These are small, but isn't this only measuring the change in levels due to the physics? Whether you start with MPAS z or CAM z, it is only measuring the change due to physics, right?

Might these 100m differences in the top ten or so levels have a larger impact on the energy fixer (not dme adjust)?

andrewgettelman commented 2 years ago

Ran this last night and did some analysis this morning. (6 years, 120km, 32L) Looks good to me:

Atmospheric Energy balance:

MPAS RESTOM 1.905 RESSURF 1.836 RESID 0.068
MPAS-dplinV2 RESTOM 2.091 RESSURF 2.064 RESID 0.028

Differences from previous MPAS are minor for (A) cloud variables (single level), (B) Temp and (C) Zonal wind. Only differences in B & C are at higher latitudes and upper levels, which could be just some noise. Or it might be real, but the differences are small.

If this works with the 60km APE 58L and the 60-3km 58L, I am fine moving forward with this change. zmean_dplinV2 zmean_T_diff_dplinV2

zmean_U_diff_dplinV2

PeterHjortLauritzen commented 2 years ago

In terms of the dry mass adjustment (dme_adjust) the change is: -0.410553 W/M^2 (dme_adjust using physics z levels) versus -0.41074 W/M^2 (dme_adjust using dycore z levels). The difference is in the 4th digit so small!

These are small, but isn't this only measuring the change in levels due to the physics? Whether you start with MPAS z or CAM z, it is only measuring the change due to physics, right?

Correct (z using CAM physics z versus MPAS z)!

Might these 100m differences in the top ten or so levels have a larger impact on the energy fixer (not dme adjust)?

In terms of energy the terms are weighted with rho*dz. Since there is very little mass up there these differences do not really show (but it is indeed a bias)

PeterHjortLauritzen commented 2 years ago

zmean_U_diff_dplinV2

Above level 15ish CAM physics z is biased low (on average) so physics "thinks" that the heights are lower than they really are. I wonder if that systematically changes the gravity wave scheme (or some other parameterization) using z as input variable which leads to jet changes?

andrewgettelman commented 2 years ago

One more figure. I used the same code to plot the zonal and annual averaged 'PMID' on each pressure level. The difference with the new dp_coupling formulation is the lower right panel. In millibars, so generally < 1mb and small, up to 2mb near the surface. Might be related to the wind/temperature differences....

zmean_PMID_diff_dplinV2

andrewgettelman commented 2 years ago

From @xhuang-ncar

Glad to see the fix and know that the pmid difference is small. As a follow up to Andrew's comparison to the control run, I can also confirm that the pmid fix works with the 60km APE 58L and the 60-3km 58L. I am going to apply this update for future CAM-MPAS simulations.

Regarding @PeterHjortLauritzen 's comment: do we need to investigate further? I guess the issue is what parameterizations use z as an input, and can we diagnose potential differences?