catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert EIA entity / annual output tables to Dagster assets #2433

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Some of these should be @asset wrappers around our existing code, others should potentially be SQL View versions.

We might end up breaking this into one "asset-wrapper" and one "sql view" issue, but that depends on how much work this turns out to be.

- [x] `pu_eia` (became `denorm_plants_utilites_eia` for uniformity and to match FERC1) 
- [x] `denorm_utilities_eia`
- [x] `denorm_plants_eia` 
- [x] `denorm_generators_eia`
- [x] `denorm_boilers_eia`
- [x] Add filtering between dates into dynamic PudlTable outputs
- [x] Integrate output function parameters into `op_config` for assets.

Dependencies and existing functions to be replaced

e-belfer commented 1 year ago

Some of our output tables have boolean 'options' that we can use to change the exact nature of the output. fill_tech_desc and unit_ids are some of the ones that appear in the EIA entity tables, but I imagine there will also be many others as we work through the other tables (e.g. freq).

  1. Option 1: Set the functions to produce the same output as the default value of these booleans. Track down any places where we use the non-default setting and try to see if the change will break it (e.g. in this instance, the only place where fill_tech_desc is false is in pudl.test.integration.glue_test). May also make trouble for other PUDL users who depend on existing assets with non-standard settings.
  2. Option 2: Make different assets for the True and False versions of key assets (e.g. denorm_generators_eia_filltech). Will preserve the variety of options but proliferate assets.
  3. Option 3: Figure out how to feed in the boolean into the dag process. This will produce the potential for multiple different outputs with the same name, and seems least ideal.

@bendnorman @zaneselvans @cmgosnell Perhaps you have thoughts here? Some of these such as frequency we'll certainly have to preserve. Others, maybe not?

cmgosnell commented 1 year ago

This may be controversial, but I think all of these bools should be set to True as the default and we should go with option 1.

Most of these bools are some version of "do you want to flesh out ___". And I think we want to flesh things out. If there is clear user communication about the difference between "output" or whatever we are calling our tables vs our clean/core current db tables, it should be clear to users that these output tables have imputations/filled in missing values/etc etc.

zaneselvans commented 1 year ago

Many of these options determine what kind of data repair is done. IIRC the last time @cmgosnell and I talked about it, we felt like we wanted the default to be 🔥FIX ALL THE THINGS🔥 aka Option 1. We are necessarily going to be pruning the possible outputs by hard-coding some default values that will be used to populate the database tables, but this will both make the data more accessible, and avoid having a whole lot of different outputs running around in different contexts. The functions themselves will stick around in the source code, so they can still be accessed if need be.

In a few cases, I think we are going to want to output more than 1 of the possible values, and the time based aggregations will be one of them. For the tables that are currently aggregatable, I think we are going to want:

And then there will be downstream assets whose aggregation depends on the frequency, e.g. the mcoe calculation requires either monthly or annual aggregation, so there'll probably be both a monthly MCOE and an annual MCOE table. Though I'm not 100% sure this is the right way to go -- it could be that aggregation is really easy, and so we just produce a raw and a monthly output, and then make an SQL view that aggregates the monthly to annual. But I think that's for a later iteration.

I think there are also some optional flags that are more mature / ready for public consumption than others. Filling in the generator technologies and BA codes are good! But I think the unit_id guessing is still kinda janky (my bad).

e-belfer commented 1 year ago

This makes sense to me. I'm hearing, aim to set booleans to defaults for all existing transforms within output tables, with the general goal of encoding all fleshed out options where they're stable and ready for public consumption (e.g. True if they're true and we generally run them, False if we've deemed them experimental enough to not be ready for general public use).

Re: the time question, absolutely will become an issue to navigate shortly (aka in a few days), and going with some variation on Option 2 makes sense here.

zaneselvans commented 1 year ago

At a later point, I think we'll want to revisit how we manage filling in missing data and may separate it from the individual table generation, but for now we're just trying to get everything we have into the DB so everyone can access it, and we don't have to distribute software.

e-belfer commented 1 year ago

Closed in #2519