Data system strategy meetings 10 and 31 January 2019

bpbond commented 5 years ago

Umbrella issue summarizing discussion between @bpbond @pralitp @rplzzz @kvcalvin .

Configurations. Related: #696 #797 . Each xml output should belong to one or more "families" and users can build just one set (default GCAM, no non-CO2s, GCAM-USA, etc). This is urgent because of the upcoming GCAM-USA PR.
Shim into the driver #12 . A mechanism to modify data (either a single number, or an entire object) in a transparent and reproducible way.
Generating differential XMLs for reproducibility and transparency #1061 #756 . In fact, generating GCAM's configuration.xml.
Consistent, integrated time settings and series. Related: #313 #1047 #872 #862 #863 #787 #455 . Should be able to change a (few) settings and generate a hindcast, or extend calibration period. Level 1 chunks should generate smooth time series and be ignorant of model periods.
Driver rework/parallelization/workflow manager. Consider something like drake #976 . Do simple parallelization test?

pralitp commented 5 years ago

A little bit of thought has been given to what the "families" might be. I've gone ahead and copied the list we came up with here as a starting point (note we may well be missing some). Note that some of these are not currently in the data system at all, while others we have already identified that we should just drop (Energy technology, Aggregate transportation):

SSPs
NDCs
Land policy (UCT, FFICT, protected land)
Regional refinements (GCAM-USA, GCAM-China)
Climate policy (carbon prices, radiative forcing targets, emissions constraints)
Agriculture impacts (AgMIP crop yields - 5 GCMs x 7 crop models)
Building impacts (multiple GCMs possible)
Energy technology (low, med, adv, no ccs)
Water (unconstrained, constrained w/ multiple surface and ground water options)
GCAM-USA variants
Simple climate models (MAGICC, Hector, None)
Transportation (aggregate and detailed)
Bio-trade
Negative emissions budget constraint
RPS
Offsets
Blend wall
Electricity structure (cooling techs)

bpbond commented 5 years ago

So the idea is that somewhere we'll maintain a matrix of outputs x families:

Output	Description	SSPs	NDCs	Land policy	GCAM-USA	etc.
`a.xml`	Snoodle market	X	X	X		X
`b.xml`	State gadgets	X	X		X	X
`c.xml`	Q parameters	X		X		X
etc.		X	X	X		X

...that is used by the driver to determine what chunks to run?

pralitp commented 5 years ago

@bpbond sorry, I totally meant to respond to this and forgot.

Well, honestly I'm not sure what the idea exactly the idea is. Maybe something like that could work, but:

...that is used by the driver to determine what chunks to run?

We would probably want something a little better than that. Mostly I'm thinking of something like socioeconomics_SSPN.xml we would not really want to have separate (set of) chunks to generate each of those when if we could just swap in a couple of CSV files for the appropriate SSP up front then just have one generic (set of) chunk to process socioeconomics.xml. On the other hand I am sure there are times when doing something for a single SSP is completely different logic than for another... Hmm.

What about adding a "configuration" object and a "preprocessor" phase. During the preprocessor phase we can do things like swap input FILEs or enable / disable chunks.

Maybe we leave the configuration object around for the MAKE phase but I could see that getting ugly fast if we have a bunch of logic about if we are building this and that then do this otherwise do that and if this other thing then yet something else. But we can try to keep that to a minimum.

Anyways, just some thoughts.

pralitp commented 5 years ago

Also FYI: I pushed a quick test to use drake on a branch called drake-test

bpbond commented 5 years ago

Meeting 2019-01-31. There are a number of different issues with the "Configurations" bullet above, involving running different chunks (e.g. GCAM-USA), changing datasets (SSPs), and general complexity (electricity structure). Pralit did a test of drake (above). Kate notes we need to tackle things incrementally–yes!

To do:

[ ] @bpbond: look at Pralit's branch
[x] @kvcalvin: put specific example of SSP issue/complexity here
[x] @pralitp: specific details of electricity structure problem
[ ] The GCAM-USA problem is the easiest.

kvcalvin commented 5 years ago

Re SSPs: The challenge with the SSPs is that (1) they touch on most aspects of the model, (2) the processing differs depending on the input, and (3) there are five of them.

With respect to (1), there are 62 chunks (~20%) in the data system that have the string "ssp" in them.

With respect to (2), for some of those chunks (e.g., population), we process the data the same, just using a different set of inputs. For others (e.g., food demand, non-co2 emissions), the processing code is different. And in some cases (e.g., electricity tech assumptions), we take variants that might be useful in other contexts (e.g., adv tech assumptions) and split the file into different parts (e.g., use adv renewable assumptions only in ssp1, discard adv assumptions for other technologies).

With respect to (3), I would expect users to either want all five or to want none.

pralitp commented 5 years ago

Re electricity structure challenge: Right now we have two variations of the electricity sector: zchunk_L223.electricity.R and zchunk_L2233.electricity_water.R which ultimately produce electricity.xml and electricity_water.xml (with the later being a re-organization of the first splitting each technology out by associated cooling systems).

The electricity family however intersects with other families such as (among others) GCAM-USA and liquids limits (aka blend wall). The zchunk_L223.electricity_USA.R and zchunk_L270.limits.R chunks each will need to get outputs from the electricity sector to generate outputs of their own ultimately generating electricity_USA.xml and liquids_limits.xml. For example:

# L270.CreditInput_elec: minicam-energy-input of oil credits for electricity techs
    A23.globaltech_eff %>%
      fill_exp_decay_extrapolate(MODEL_YEARS) %>%
      mutate(value = round(value, energy.DIGITS_EFFICIENCY)) %>%
      filter(subsector == "refined liquids") %>%
      mutate(minicam.energy.input = "oil-credits",
             # note we are converting the efficiency to a coefficient here
             coefficient = energy.OILFRACT_ELEC / value) %>%
      select(-value) %>%
      rename(sector.name = supplysector,
             subsector.name = subsector) ->
      L270.CreditInput_elec

However when the electricity water is used we need to also generate water_elec_liquids_limits.xml:

L270.CreditInput_elec %>%
      left_join(L2233.TechMap, by = c("sector.name" = "from.supplysector",
                                      "subsector.name" = "from.subsector",
                                      "technology" = "from.technology")) %>%
      mutate(sector.name = to.supplysector,
             subsector.name = to.subsector,
             technology = to.technology) ->
      L2233.CreditInput_elec

But of course the liquids limits family and the GCAM-USA family also intersect so we also have a liquids_limits_USA.xml. And in principal GCAM-USA also should have permutations for the electricity water so we should have a electricity_water_USA.xml and then also water_elec_liquids_limits_USA.xml.

And of course in this example I ignored other families such as SSPs or NonCO2s which in principal should also have their own permutations.

I think in the most ideal scenario we should just generate one electricity.xml where we somehow modify the tables to add / swap out tibbles to that XML based on which families are configured.

JGCRI / gcamdata

Data system strategy meetings 10 and 31 January 2019 #1080