Closed zaneselvans closed 1 year ago
Some of our output tables have boolean 'options' that we can use to change the exact nature of the output. fill_tech_desc
and unit_ids
are some of the ones that appear in the EIA entity tables, but I imagine there will also be many others as we work through the other tables (e.g. freq
).
fill_tech_desc
is false is in pudl.test.integration.glue_test
). May also make trouble for other PUDL users who depend on existing assets with non-standard settings.denorm_generators_eia_filltech
). Will preserve the variety of options but proliferate assets.@bendnorman @zaneselvans @cmgosnell Perhaps you have thoughts here? Some of these such as frequency we'll certainly have to preserve. Others, maybe not?
This may be controversial, but I think all of these bools should be set to True
as the default and we should go with option 1.
Most of these bools are some version of "do you want to flesh out ___". And I think we want to flesh things out. If there is clear user communication about the difference between "output" or whatever we are calling our tables vs our clean/core current db tables, it should be clear to users that these output tables have imputations/filled in missing values/etc etc.
Many of these options determine what kind of data repair is done. IIRC the last time @cmgosnell and I talked about it, we felt like we wanted the default to be 🔥FIX ALL THE THINGS🔥 aka Option 1. We are necessarily going to be pruning the possible outputs by hard-coding some default values that will be used to populate the database tables, but this will both make the data more accessible, and avoid having a whole lot of different outputs running around in different contexts. The functions themselves will stick around in the source code, so they can still be accessed if need be.
In a few cases, I think we are going to want to output more than 1 of the possible values, and the time based aggregations will be one of them. For the tables that are currently aggregatable, I think we are going to want:
And then there will be downstream assets whose aggregation depends on the frequency, e.g. the mcoe
calculation requires either monthly or annual aggregation, so there'll probably be both a monthly MCOE and an annual MCOE table. Though I'm not 100% sure this is the right way to go -- it could be that aggregation is really easy, and so we just produce a raw and a monthly output, and then make an SQL view that aggregates the monthly to annual. But I think that's for a later iteration.
I think there are also some optional flags that are more mature / ready for public consumption than others. Filling in the generator technologies and BA codes are good! But I think the unit_id
guessing is still kinda janky (my bad).
This makes sense to me. I'm hearing, aim to set booleans to defaults for all existing transforms within output tables, with the general goal of encoding all fleshed out options where they're stable and ready for public consumption (e.g. True if they're true and we generally run them, False if we've deemed them experimental enough to not be ready for general public use).
Re: the time question, absolutely will become an issue to navigate shortly (aka in a few days), and going with some variation on Option 2 makes sense here.
At a later point, I think we'll want to revisit how we manage filling in missing data and may separate it from the individual table generation, but for now we're just trying to get everything we have into the DB so everyone can access it, and we don't have to distribute software.
Closed in #2519
Some of these should be
@asset
wrappers around our existing code, others should potentially be SQL View versions.We might end up breaking this into one "asset-wrapper" and one "sql view" issue, but that depends on how much work this turns out to be.
Dependencies and existing functions to be replaced
denorm_utilities_eia
is the dagster asset version of the existingutils_eia860
output table. It is currently generated throughpudl.output.eia860.utilities_eia860
and called inpudltabl.utils_eia860
. It depends on these normalized tables:utilities_entity_eia
,utilities_eia860
,utilities_eia
denorm_plants_eia
is the dagster asset version of the existingplants_eia860
output table. It is currently generated throughpudl.output.eia860.plants_eia860
and called inpudltabl.plants_eia860
. It depends on these normalized tables:plants_entity_eia
,plants_eia860
,plants_eia
denorm_generators_eia
is the dagster asset version of the existinggens_eia860
output table. It is currently generated throughpudl.output.eia860.generators_eia860
and called inpudltabl.gens_eia860
. It depends on these normalized tables:generators_entity_eia
,generators_eia860
,plants_entity_eia
,boiler_generator_assn_eia860
,pu_eia
denorm_boilers_eia
is the dagster asset version of the existingboil_eia860
output table. It is currently generated throughpudl.output.eia860.boilers_eia860
and called inpudltabl.boil_eia860
. It depends on these normalized tables:generators_entity_eia
,generators_eia860
,plants_entity_eia
,boiler_generator_assn_eia860
,pu_eia