Extreme orographic cold pools forming in 4km E3SM runs

zarzycki commented 2 years ago

Requestioned by @paullric who asked me to also cc: @ambrad.

As part of HyperFACETS, I have been running some 4km refined simulations (ne30 refined x32) over the WUS. These apply essentially the same horizontal grids used in Rhoades et al., 2018.

I am using Betacast to initialize E3SM in December 1996 from ERA5 data. We have used this technique quite a bit over the past few years for storylines and attribution work.

When running the 4km simulations, I repeatedly ran into model crashes. Interestingly, these crashes did not appear to be directly dycore related, as the model was quite happy many timesteps into the run and would complete 7-day simulations happily with various initializations.

A few things I have tried.

Reducing various SE timesteps well above and beyond what would be considered reasonable.
Greatly reducing the physics timestep to O(1min) -- thereby increasing the coupling frequency as well.
Changing hyperviscosity coefficients (i.e., nu) and V-R scaling.
Playing with CLUBB's vertical diffusivity (ck10 and ck10h)
Decreasing clubb_timestep
Turning on the PBL wind linearization (that I believe was implemented by Sean Santos?)
Modifying the amount of topographic smoothing used in bnd_topo.
Increasing the frequency at which radiation is called (iradsw/iradlw = 1)
Regenerating ELM fsurdat files with multiple tags.
Running the simulation with a version of E3SM near the head of the trunk and a version from ~18 months ago.

This happens with both theta-l (we are currently running) and preqx (previous runs). We are running in hydrostatic mode, although we have run CAM-SE with these grids fine and in the cool season motions should be relatively hydrostatic over this area. We are also running with ZM on, although I think Alan Rhoades found that convective tendencies are super small over this area during winter.

Essentially, what appears to happen is that cold air is pooling in various valleys in the high resolution portion of the nest during the overnight hours. This can be most obviously seen in the TREFHT field, but also shows up in other thermodynamic fields + PSL. To verify this, I printed output every timestep. Here is a brief animation (apologies for the super awesome ncview + imagemagick job) showing the cold contours appearing and evolving in the surface temp over a day or so (output every few minutes). You can see a very large one (TS ~ 205K!) and some smaller ones "pop" up in the valley to the south, although they only get down to ~250K or so.

coldblobs

Note that these are wildly transient features. While these "blobs" do occur in most runs, their strength, location, duration, etc. seems to vary run-to-run and with subtle initialization changes, so it seems to be somewhat random from a numerics perspective.

I have currently interpreted this as an instability caused by some combination of runaway katabatic winds and surface decoupling associated with stable PBLs. They tend to occur (anecdotally) with high westerly winds at the ridgetop. My hypothesis is that as wind rapidly flows down the orography, it pools in valleys and rapidly generates a feedback loop where the atmosphere stabilizes and TKE/fluxes are shut off. There is then no energy mechanism to balance LW cooling (it's night so no SW, no surface fluxes or other mixing from the atmosphere).

Note that sometimes this results in uber cold surface temperatures (like 180K!) -- I would say about 30% of the time we get columns TS<200K the model crashes, but about 70% of the time it actually recovers, either by eventually mixing the cool air or having it heated through some mechanism (SW rad after sunrise?)

The only solution to this for our runs operationally has been to set:

vtheta_thresh=220.0    ! default 100.0

which triggers the potential temperature limiter in the dycore at 220K. The model will hit this a few times (as evidenced by some CAAR errors in the log files) but this has essentially stabilized our runs 100%. Looking at the timestep output, these cold pools still form but can't "runaway" and are eventually warmed/mixed out.

Another fly in the ointment is to how unique this is to particular model conditions. I have not had the cycles to run a seasonal simulation to see if this issue keeps appearing all winter in a 4km run.

Anyways, this is rapidly pushing the edge of my land/atm coupling knowledge -- not sure if addressing surface drag (GW scheme?) would be any benefit, but figured it's worth putting here.

ambrad commented 2 years ago

Also cc @mt5555 and @oksanaguba.

whannah1 commented 2 years ago

@zarzycki Can you elaborate on your efforts to verify whether ZM could be involved? You say that "convective tendencies" are small, but I'm not sure that would be a good indicator of ZM's ability to generate an evaporatively driven downdraft.

Rather than trying to add special output to check for a ZM influence directly, I think it would be easier to just disable ZM altogether and see how the surface temperature extremes change. Alternatively... we could somewhat easily do an experiment with ad-hoc mods to disable ZM based on the parent grid cell area... I'd be really interested in that experiment either way.

zarzycki commented 2 years ago

That's a good point. I did look at some of the tendencies coming from ZM at some point and didn't see anything outrageous, but I suppose it only require some moderate acceleration down to exacerbate the problem? Although I am running with extremely small dtime without tweaking zm_tau, which has also tend to tilt activity away from ZM.

Seems easy enough... when Cori comes back up and I can toss a run in with the deep_scheme off and see if anything changes.

mt5555 commented 2 years ago

This seems very similar to the "cold T" problem we see at higher resolutions. That was also very insensitive to most dycore parameters, but was the motivation for adding the vtheta_thresh code - which did control it. But in our previous simulations it was always in very dry conditions at coastlines.

to fix it, (so that the vtheta_thresh code is rarely triggered), we ended up adjusting a mixing parameter in SHOC (and a similar parameter exisits in CLUBB). In CLUBB, it's the lmin_coeff. IIRC, it reduces the amount vertical mixing in dry conditions, and lowering the parameter allows for more mixing. Not sure if this is the same issue, but you could try lowering lmin_coeff from .4 to 0.025. (if this is using SHOC instead of CLUBB, the parameter might have a different value).

(edit: 0.025, not 0.25)

swrneale commented 2 years ago

@zarzycki what vertical resolution are you using as existing large cooling tendencies at the surface from ZM can be exacerbated in L58 for example. Seems unlikely to be causing such strong cooling you're seeing I would have thought.

zarzycki commented 2 years ago

@mt5555 Will look into it and report back!

@swrneale L72.

tangq commented 2 years ago

We have these "cold T" CAAR errors in v2 RRM runs refining to 25 km, leading to the increase of the default vtheta_thresh to 100 K (from 10 K?). It seems that at a higher resolution (e.g., 4 km) the model needs a higher threshold to recover from these cold cases.

indahector commented 1 year ago

I have been running some 14km refined simulations (ne30 refined x8) over Mexico and Southwestern USA. When running, I repeatedly ran into model crashes. Interestingly, I would go around the crashing, increasing se_nsplit by 1, but after a couple of months, the model would crash again. I did this recursively and managed to run for almost 7 years, but it got to a point where se_nsplit was around 40 and very slow to run.

Apparently, this could also be related to the cold air pooling in the high-resolution portion of the nest during the overnight hours. I followed the same methodology tried by @zarzycki, and I printed the output every timestep before the model crashed. Here is a brief animation of TS. You can see a cold pool forming near the Rockies with TS~200K.

Right now, I am running the model using the limiter proposed by @zarzycki:

vtheta_thresh=220.0 ! default 100.0

with no model crashes after 3 simulated years.

https://user-images.githubusercontent.com/68174511/215621507-0baf55e2-0a23-4fb9-80c1-7f65ecb2429b.mp4

mt5555 commented 1 year ago

Have you tried changing the CLUBB parameter lmin_coeff from .4 (default) to 0.025?

zarzycki commented 1 year ago

I do think I tried that previously and it was not a silver bullet (that is, there may have been occasions it delayed the model crash, but the limiter was the only thing that really “solved” things at the time). Hector, not sure if that’s worth trying with your “bad” configuration (you’ll want to roll back se_nsplit to the smaller value).

I haven’t looked at the ELM tiles — I’ve been assuming this is an EAM issue, but it could be some sort of land surface coupling with snow/ice or other interesting surface type.

On Tue, Jan 31, 2023 at 9:40 AM Mark Taylor @.***> wrote:

Have you tried changing the CLUBB parameter lmin_coeff from .4 (default) to 0.025?

— Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/E3SM/issues/5089#issuecomment-1410712015, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACO43QYKCXVFSKB52BCDTKLWVE6A5ANCNFSM54E7QQ3A . You are receiving this because you were mentioned.Message ID: @.***>

indahector commented 1 year ago

Thank you, @zarzycki and @mt5555. @mt5555 I have not tried modifying CLUBB parameters. @zarzycki I did roll back to se_nsplit=5 and the initial simulation date; thanks for clarifying that point.

By the way, I have been running using the vtheta limiter for 8 years now without any apparent troubles. I'll update you if something changes

mt5555 commented 1 year ago

this issue is similar to what we see in several other model configurations at high resolution - and I think it is a lack of vertical diffusion in dry stable boundary layer conditions. That's what the CLUBB parameter is trying to addressing. The limiter is a straight up truncation. Long term it would be nice to tie that to some vertical dissipation.

zarzycki commented 1 year ago

Yes, it seems likely it's an overly stable atmosphere. If there is little vertical mixing, the surface rapidly cools via LW out at night. If the atmospheric is dry, no clouds are formed to arrest the cooling by absorbing/back-radiating energy. The only savior in that case would be SW radiation when the sun comes up or horizontal advection.

I was looking at the code and I saw the below in parameters_tunable.F90.

  real( kind = core_rknd ), private :: &
    lmin_coef = 0.100000_core_rknd    ! Coefficient of lmin    [-]

    ! Constant Parameters
    real( kind = core_rknd ), parameter :: &
      lmin_deltaz = 40.0_core_rknd ! Fixed value for minimum value for the length scale.

lmin = lmin_coef * lmin_deltaz ! New fixed value

So currently the default would be ~4m. That is, Lscale is approximately truncated at 4m. Larger Lscale means less damping on turbulence, so I think we'd want a higher lmin_coef?

CLUBB does have host_dx and host_dy which would in theory allow the minimum Lscale to be scaled as a function of horizontal grid spacing if desirable.

mt5555 commented 1 year ago

My understanding (from what I've been told, not from looking at the code) is that "lmin_coeff" is a threshold - the minimum at which some process can be turned on. So lowering lmin_coeff allows a vertical mixing process to be turned on more often, resulting in more mixing. Pinging @vlarson to make sure.

In my experiments from 2020, at global NE256, we had 1-2 events in a 3 month period, and lowering lmin_coeff eliminated them. But it was only a 3 month run, so could have just been a different trajectory.

whannah1 commented 1 year ago

Do we have any sense of what the smallest reproducer of this problem might be?

mt5555 commented 1 year ago

In my notes from different SCREAM and E3SM simulations, I've been collecting issues that I think are all related to this. But not 100% sure so take this with a lot of caveats:

NE120: rate (a few times per century). @tangq saw this once in one of his long RRM simulations with ne120 over CONUS.

NE256 (E3SM): ~1 per month

NE1024 (SCREAM): ~1 per day

mt5555 commented 1 year ago

@zarzycki wrote "The only savior in that case would be SW radiation when the sun comes up or horizontal advection." -> this observation might help explain our dycore experiments, where the only thing that helped was horizontal diffusion. But as these are rare, isolated and short lived, it seems a mistake to crank up the global horizontal diffusion.

whannah1 commented 1 year ago

I'm very curious to see if the actual radiative cooling rates are consistent with this view that under-mixing and over-stabilization are the root cause, as I'm still very skeptical about that.

zarzycki commented 1 year ago

I have a 4km run I think I can repro this issue on during Day 1-2 (well, I was able to 6 months ago, fingers crossed it still works). @whannah1 what output fields would help diagnose?

whannah1 commented 1 year ago

@zarzycki I'm thinking we need a brute force approach here where we get a restart within a couple hours of the event and output as many vertical resolved variables we can think of at each time step, restricted to the region of interest.

How large is this grid, in terms of number of physics columns? And is it np4 or pg2?

indahector commented 1 year ago

@whannah1 I also have my grid running right now, in case it also could help. As @zarzycki suggested to me, I have a branch case right before the cold pool crash and outputting every time step. I could re-run that and provide the output. It is a ne30x8 (~14 km in its finest region). It has 21309 se nodes, and I believe is an np4 grid.

vlarson commented 1 year ago

My understanding (from what I've been told, not from looking at the code) is that "lmin_coeff" is a threshold - the minimum at which some process can be turned on. So lowering lmin_coeff allows a vertical mixing process to be turned on more often, resulting in more mixing. Pinging @vlarson to make sure.

In my experiments from 2020, at global NE256, we had 1-2 events in a 3 month period, and lowering lmin_coeff eliminated them. But it was only a 3 month run, so could have just been a different trajectory.

lmin_coef is a coefficient in front of CLUBB's lmin parameter, which is a threshold that limits CLUBB's turbulent mixing length near the lower surface. A smaller length scale means stronger damping by CLUBB's turbulent dissipation terms. I have to confess that it is counter-intuitive to me why that helps in stably stratified layers.

At UWM, we've also contemplated the idea of boosting nu1 or nu9 (background diffusivities for w'2 and u'2) in order to boost turbulence near the surface in stable conditions. But we've never actually tried it to see if it helps.

E3SM-Project / E3SM

Extreme orographic cold pools forming in 4km E3SM runs #5089