Met and input ensembles: adjustments to do.conversions, met.process, write.configs, and sda.enkf

mdietze commented 7 years ago

PEcAn needs the ability to handle ensembles of inputs as a way of capturing uncertainties in drivers and initial conditions.

The following comes from a discussion with @araiho and @tonygardella at the PalEON SDA Hackathon

Meteorology

Focusing first on the meteorology as the first case we want to deal with.

Key parts of the proposed design:

ensemble members will be named prefix.ens[ID] (e.g a CF met ensemble would be prefix.ens[ID].[year].nc). Therefore, most of the existing functions that will have to deal with ensembles will simply need to be called in a loop and internally they will have no idea that they're dealing with an ensemble member, they'll just treat the prefix.ens[ID] as the prefix.
Each ensemble member will be inserted into the database individually, not as one entry for the whole ensemble. This seems to make implementation easier and improves provenance (we'll know exactly which ensemble member was used for each run).

The above proposal would NOT be compatible with a previous discussion about the potential capacity to not store ensemble members but to be able to generate them on the fly having saved the seed. This comes at a cost of increased disk storage but much simpler and less error prone provenance & repeatability.

Within met.process we envision two major use cases:

1) the input meteorology is itself an ensemble

within met.process we're basically just looping over each ensemble member

2) met.process itself generates the ensemble

For example, we would download one met, CF one met, generate N ensemble members within gapfill/downscale (so we'd end up with a whole list of results that would result in a whole vector of input.id's), then we'd loop over every ensemble member when calling met2model (so met2model would have no idea it's processing an ensemble member).

write.configs

We'll need to update the ensemble option (code and settings) to let you choose WHAT you want an ensemble of (just params, just select inputs, both parameters and inputs). This would pass a specific list of inputs to each write.config.[model], so the model code doesn't need to know anything about ensembles

SDA

prior to the first call of split.inputs, need to detect the maximum number of ensembles (which input type has the largest ensemble) and create a vector of samples from that. To not waste ensembles, we'll take the first n ensembles in order then sample with replacement if we need a larger ensemble

split.inputs.[MODEL]

split.inputs will take a new argument, ensemble number, which will default to 1

For each input, choose the modulus of the ensemble number. For example, if there are 50 met drivers and 5 soil drivers and ens = 48, then we use met ensemble member 48 and soil ensemble member 48%%5 = 3. Then proceeds to do any split as before. Returns an list of inputs where each input only has that ensemble member's drivers.

The code calling split.inputs will loop over the ensemble sample vector and save a whole list of input lists

write.configs

Loop over ensemble members should just need inputs to be changed to inputs[[i]] to make the inputs ensemble member specific.

jsimkins2 commented 7 years ago

@mdietze adding this functionality to the downscale script that's nearly ready for a pull request. Here is the name generated for each ensemble, "MACA.dwnsc.ens1.2006.nc". How does that look? The file that was downscaled is titled "MACA.IPSL-CM5A-LR.rcp85.r1i1p1.2006.nc", should I add the MACA specific information to the downscale ensemble member as well?

mdietze commented 7 years ago

A bit long, but yes, you should probably add the specification of what GCM, RCP, etc. was run to the prefix.

istfer commented 6 years ago

after today's meetings, just wanted to tag people who are thinking about met ensembles and met uncertainty here again for joining efforts and furthering discussions @mdietze @araiho @ankurdesai @mollyaufforth

mdietze commented 6 years ago

Yesterday @bcow, @Luke-Dramko, and I had a chat where we revisited this design, and reconsidered whether there was a better option that would 'conserve' input ID's better given that the numerical weather forecasting example would mint 21 new input ID's per site per 6hr (~30k/yr) for every site that we set up real-time forecasting for (though right now we're only planning on daily, and NEON + Willow Creek as the sites). For now we didn't find one. We also flushed out some of the details more of what needs to be implemented for the numerical weather forecast met example and who that intersects with:

We'd like to add three additional pieces of information to the met registry files.
- <ensemble> - defaults to FALSE for the case where a met product provides a single time series; If a met product provides an ensemble, set to the ensemble size
- <forecast> - defaults to FALSE, setting to TRUE changes how date conflicts are handled
- information about the pieces of information being written out in the filename. In particular, the POSIX format for the start and end times of a met file. Right now the default is that start time is %Y (4 digit year) and there's not end time, but for numerical weather forecasts we also need to write out the month, day, and time the forecast was made (and let's stick to ISO standard formats please!)
met.process functions that write out ensembles should put each ensemble member in it's own appropriately named folder (i.e. include ensemble name or number) and the ensemble info should be in the file names too. So, for example, the NOAA_GEFS download should return files that are something like NOAA_GEFS.[site].[ens].[start_datetime].[end_datetime].nc
Right now met.process functions return a dataframe. Functions that produce ensemble met (either by download or by downscaling) should return a list of dataframes. When doing the database insert met.process should loop over this list and insert every ensemble member as it's own input. This would result in that stage of met process producing a vector of input ID's rather than a single input ID.
Downstream, any module in met process that receives a vector input ID's should loop over calls to that module. The module would handle each ensemble member individually and would have no idea that this is an ensemble, thus requiring no internal changes. Because each call to the module produces a single input id, the result of the loop would be a vector of input IDs
Within convert.inputs we previously spent a good chunk of time creating a system for appending new years to existing met time series, and this code is unfortunately still a bit fragile (it works but no one wants to touch it in as it's easy to break). For iterative forecasts we'll need to create a new, high-level case that skips over all of this complexity and performs a different set of date checks. Specifically, if it's given a new start time, it should call the met function and create a new input. If it's given an existing start time, it should detect that we've already processed that weather forecast and return the input IDs (which will most likely be a vector of input ID's of length <ensemble>
At the end of met process, the <settings> object should have a list of <met> entries within <run><inputs>. If written out to xml it would look something like:
```
<run>
<inputs>
     <met>
          <id> 1234 </id>
      </met>
     <met>
          <id> 1235 </id>
      </met>
     <met>
          <id> 1236 </id>
      </met>
  </inputs>
</run>
```
This should also be a valid way to specify ensemble inputs in the pecan.xml at the start, or for any other inputs that we might ensemblize (IC/veg, soils, etc)
Downstream from met.process, I've already spoken to @istfer and @para2x about the fact that we need to refactor the SDA code and to take a close look at run.write.configs to make sure that the code for generating model ensembles is modularized/shared between these two -- conceptually there should be no difference between ensembles started by the main workflow and those started by SDA, and indeed SDA should be able to pick up a general ensemble run and perform an Analysis and reforecast on it.
Within the web interface's met pulldown menu, ensemble weather forecasts need to be FILTERED OUT so that only the name tag is there -- we don't want to see thousands of met files for a site that just differ by start date and ensemble member. Also, when doing a run based on numerical weather forecasts the start data is what's followed and the end date should be readjusted accordingly (e.g. if you want to do a NOAA_GEFS run that starts on 2018-06-01 and ends 2018-07-01 then the end date would become 2018-06-16 because the forecast is only 16 days)
In numerical weather forecasts generated by others there is no 'best' ensemble member. By contrast, if we generate our own ensembles for other inputs (e.g. soils, initial conditions) I think we should retain the parameter ensemble assumption of putting the 'best' estimate in the first slot so that becomes the one that's used if the user asks for a single run (ensemble size = 1).

github-actions[bot] commented 4 years ago

This issue is stale because it has been open 365 days with no activity.

PecanProject / pecan