E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
79 stars 56 forks source link

Can we omit levels above some threshold when writing 3d output? #2652

Open PeterCaldwell opened 10 months ago

PeterCaldwell commented 10 months ago

The top bunch of layers in EAMxx is occupied by the sponge layer and most/all of our analysis is in the troposphere, so maybe we could save substantial disk space by just not writing out levels above some threshold. I suspect that threshold would have to be model level because we'd still want a constant number of levels for all columns and we can already do this for pressure coordinate output by just not asking for any very small pressure levels.

Not a priority, just a thought.

AaronDonahue commented 10 months ago

Yes, I think this should be pretty straightforward. We could add a parameter to the output stream control that specifies a level range. Then we would need to think about an efficient way to copy a view such that it is just those levels in the range. Maybe a diagnostic similar to what we currently do for Field at Level. @bartgol could probably comment on best approach.

bartgol commented 10 months ago

I see a few options. Probably the most intriguing is to power up our implementation of subfields. Currently, we can only create a subfield when fixing a single index of one of its dimensions. I've been thinking about allowing to subviewing an interval of indices (which kokkos already supports). We would have to modify some internal stuff in the field metadata, but nothing too complicated, I think.

Once that is done, we could do something like this, during the output run:

for (auto const& name : m_fields_names) {
    // This is the current impl
   //  auto  field = get_field(name,"io");

   // This is the proposed new impl
   Field field;
   if (m_has_level_bounds) {
     auto whole_f = get_field(name,"io");
     int lev_dim = f.rank()-1; // levels are always at the end
     field = whole_f.subfield(lev_dim,m_lev_bounds); // lev bounds can be a std::pair<int,int>
  } else {
    field = get_field(name,"io");
  }

Later on, we'd have to always use field.get_strided_view instead of field.get_view, or provide 2 impl of the switch right below (since subviewing the last dim will create strided views). But that's not a big deal.

One catch: what do we do regarding midpoints vs interface quantities? Do we use the same bounds? Or should interface quantities always have 1 more entry than midpoint ones?

PeterCaldwell commented 10 months ago

should interface quantities always have 1 more entry than midpoint ones?

I don't see how interfaces wouldn't always have 1 more entry than midpoints. This seems like a fundamental property of geometry. For example |x|x|x| where x is midpoint and | is interface...

PeterCaldwell commented 10 months ago

The subfield approach makes sense to me. I'm still not clear what this would mean for yaml files or for output files. Would we insist that only the first 80 entries of one variable were written out, then the whole output file would contain variables with just 80 vertical levels?

PeterCaldwell commented 10 months ago

Update: max tropopause height is typically ~70 mb. Chris T reports that we have ~100 layers below 70mb using the "lev" output dimension (which is like a global ave of the pressure at a given model level). Thus implementing this change would save us ~20% of our storage cost for 3d fields. That's substantial but not overwhelming.

@brhillman - do you think it would be worth waiting our simulations for a week in order to have this reduction? Also, I'm totally guessing how long it would take to get this done b/c everyone is already working on important stuff and I don't want to presume that this would be that easy.

There's also a question of whether it would make more sense to just write output on a set of tropospheric pressure levels if one wanted to omit the stratosphere.

bartgol commented 10 months ago

The subfield approach makes sense to me. I'm still not clear what this would mean for yaml files or for output files. Would we insist that only the first 80 entries of one variable were written out, then the whole output file would contain variables with just 80 vertical levels?

Yes, what I was envisioning is adding a new (optional) parameter to the yaml file, something like vertical_level_lower_bound: 20 or something, which would cause levels 0-19 to be ignored _for all 3D quantities`.

bartgol commented 10 months ago

There's also a question of whether it would make more sense to just write output on a set of tropospheric pressure levels if one wanted to omit the stratosphere.

Perhaps we could simply take advantage of the vertical remap capabilities? I don't know if we have an actual number for the cost overhead, but I'm sure we can find out manually by

  1. run a case with a single output yaml file, outputing N fields
  2. run the same case with the same output yaml file, but adding a vert remap file, with the vert remap being "an identity" (that is, use the same hybrid coords as the model).

This should give us a feeling of how much it costs to run vert remap.

Edit: thinking a bit more about this, I'm not sure vert remap can "focus on a piece of the atm". I wonder if it simply increases/decreases the number of levs, but doesn't change bot/top location. @AaronDonahue may know better.

AaronDonahue commented 10 months ago

I think vertical remapping would get us what we want w.r.t. grabbing all levels in say the tropopause. The difference is we would get remapped (interpolated) values onto the set of pressure levels we care about, rather than the raw data in say the lowest 100 levels. Not sure how important it would be to have raw data for our 3D analysis.