Convert EIA entity / annual output tables to Dagster assets

zaneselvans commented 1 year ago

Some of these should be @asset wrappers around our existing code, others should potentially be SQL View versions.

We might end up breaking this into one "asset-wrapper" and one "sql view" issue, but that depends on how much work this turns out to be.

- [x] `pu_eia` (became `denorm_plants_utilites_eia` for uniformity and to match FERC1) 
- [x] `denorm_utilities_eia`
- [x] `denorm_plants_eia` 
- [x] `denorm_generators_eia`
- [x] `denorm_boilers_eia`
- [x] Add filtering between dates into dynamic PudlTable outputs
- [x] Integrate output function parameters into `op_config` for assets.

Dependencies and existing functions to be replaced

denorm_utilities_eia is the dagster asset version of the existing utils_eia860 output table. It is currently generated through pudl.output.eia860.utilities_eia860 and called in pudltabl.utils_eia860. It depends on these normalized tables: utilities_entity_eia, utilities_eia860, utilities_eia
denorm_plants_eia is the dagster asset version of the existing plants_eia860 output table. It is currently generated through pudl.output.eia860.plants_eia860 and called in pudltabl.plants_eia860. It depends on these normalized tables: plants_entity_eia, plants_eia860, plants_eia
denorm_generators_eia is the dagster asset version of the existing gens_eia860 output table. It is currently generated through pudl.output.eia860.generators_eia860 and called in pudltabl.gens_eia860. It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia
denorm_boilers_eia is the dagster asset version of the existing boil_eia860 output table. It is currently generated through pudl.output.eia860.boilers_eia860 and called in pudltabl.boil_eia860. It depends on these normalized tables: generators_entity_eia, generators_eia860, plants_entity_eia, boiler_generator_assn_eia860, pu_eia

e-belfer commented 1 year ago

Some of our output tables have boolean 'options' that we can use to change the exact nature of the output. fill_tech_desc and unit_ids are some of the ones that appear in the EIA entity tables, but I imagine there will also be many others as we work through the other tables (e.g. freq).

Option 1: Set the functions to produce the same output as the default value of these booleans. Track down any places where we use the non-default setting and try to see if the change will break it (e.g. in this instance, the only place where fill_tech_desc is false is in pudl.test.integration.glue_test). May also make trouble for other PUDL users who depend on existing assets with non-standard settings.
Option 2: Make different assets for the True and False versions of key assets (e.g. denorm_generators_eia_filltech). Will preserve the variety of options but proliferate assets.
Option 3: Figure out how to feed in the boolean into the dag process. This will produce the potential for multiple different outputs with the same name, and seems least ideal.

@bendnorman @zaneselvans @cmgosnell Perhaps you have thoughts here? Some of these such as frequency we'll certainly have to preserve. Others, maybe not?

cmgosnell commented 1 year ago

This may be controversial, but I think all of these bools should be set to True as the default and we should go with option 1.

Most of these bools are some version of "do you want to flesh out ___". And I think we want to flesh things out. If there is clear user communication about the difference between "output" or whatever we are calling our tables vs our clean/core current db tables, it should be clear to users that these output tables have imputations/filled in missing values/etc etc.

zaneselvans commented 1 year ago

Many of these options determine what kind of data repair is done. IIRC the last time @cmgosnell and I talked about it, we felt like we wanted the default to be 🔥FIX ALL THE THINGS🔥 aka Option 1. We are necessarily going to be pruning the possible outputs by hard-coding some default values that will be used to populate the database tables, but this will both make the data more accessible, and avoid having a whole lot of different outputs running around in different contexts. The functions themselves will stick around in the source code, so they can still be accessed if need be.

In a few cases, I think we are going to want to output more than 1 of the possible values, and the time based aggregations will be one of them. For the tables that are currently aggregatable, I think we are going to want:

raw (unaggregated)
monthly
annual

And then there will be downstream assets whose aggregation depends on the frequency, e.g. the mcoe calculation requires either monthly or annual aggregation, so there'll probably be both a monthly MCOE and an annual MCOE table. Though I'm not 100% sure this is the right way to go -- it could be that aggregation is really easy, and so we just produce a raw and a monthly output, and then make an SQL view that aggregates the monthly to annual. But I think that's for a later iteration.

I think there are also some optional flags that are more mature / ready for public consumption than others. Filling in the generator technologies and BA codes are good! But I think the unit_id guessing is still kinda janky (my bad).

e-belfer commented 1 year ago

This makes sense to me. I'm hearing, aim to set booleans to defaults for all existing transforms within output tables, with the general goal of encoding all fleshed out options where they're stable and ready for public consumption (e.g. True if they're true and we generally run them, False if we've deemed them experimental enough to not be ready for general public use).

Re: the time question, absolutely will become an issue to navigate shortly (aka in a few days), and going with some variation on Option 2 makes sense here.

zaneselvans commented 1 year ago

At a later point, I think we'll want to revisit how we manage filling in missing data and may separate it from the individual table generation, but for now we're just trying to get everything we have into the DB so everyone can access it, and we don't have to distribute software.

e-belfer commented 1 year ago

Closed in #2519

catalyst-cooperative / pudl

Convert EIA entity / annual output tables to Dagster assets #2433

Dependencies and existing functions to be replaced