The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
Currently all clean EIA-860 and EIA-923 tables are inputs into our entity resolution (harvesting) process as part of the eia_transform multi-asset, and all of the final EIA-860 and EIA-923 tables are also outputs generated by the eia_transform asset. I think this creates the illusion dependency bottleneck which doesn't need to be there.
Only the "harvested" entity and association tables need to be outputs from the complex harvesting process. The fact that all of the tables pass through this process is partly an artifact of our passing around dictionaries of all processed dataframes as inputs/outputs (which obscures real dependencies and makes accidental mutation easy).
Instead, we could continue to pass all the inputs in, but do the work of dropping the columns that don't belong in them separate from entity resolution.
I think the tables that actually need to be outputs from the harvesting process are:
utilities_entity_eia & utilities_eia860
plants_entity_eia & plants_eia860
generators_entity_eia & generators_eia860
boilers_entity_eia & boilers_eia860
Tables that depend on harvested values:
The boiler_generator_assn_eia860 asset could be split out since it only depends on a couple of input tables, and produces a single output table.
The cleanup and backfilling of balancing_authority names and IDs could also be split out from eia_transform and run in parallel with the boiler_generator_assn_eia860 processing.
The encoding of the entity tables could also happen in parallel after the harvesting has taken place.
Breaking this long-running job out into smaller pieces will both clarify what actual dependencies exist, and allow more work to be done in parallel, speeding the ETL up.
# In Scope
- [x] Determine real outputs of `eia_transform` function (entity & annual tables only)
- [x] Create harvested asset factory for entities (plant, boiler, generator, utility)
- [x] Create a separate asset to compile `boiler_generator_assn_eia860`
- [x] Create assets that do final operations on non-entity tables (enforce_schema + ???)
- [x] Get all tables loading into the DB again
- [x] Check that all FK constraints are still respected
- [x] Remove `keep_cols` from harvesting process and just... keep all the cols all the time.
- [x] Run `tox -e validate` locally to check whether we've changed the data.
- [x] Run the full data validation tests locally.
Out of Scope
Refactoring the BGA function to be less of a mess! Don't do it!
Currently all clean EIA-860 and EIA-923 tables are inputs into our entity resolution (harvesting) process as part of the
eia_transform
multi-asset, and all of the final EIA-860 and EIA-923 tables are also outputs generated by theeia_transform
asset. I think this creates the illusion dependency bottleneck which doesn't need to be there.Only the "harvested" entity and association tables need to be outputs from the complex harvesting process. The fact that all of the tables pass through this process is partly an artifact of our passing around dictionaries of all processed dataframes as inputs/outputs (which obscures real dependencies and makes accidental mutation easy).
Instead, we could continue to pass all the inputs in, but do the work of dropping the columns that don't belong in them separate from entity resolution.
I think the tables that actually need to be outputs from the harvesting process are:
utilities_entity_eia
&utilities_eia860
plants_entity_eia
&plants_eia860
generators_entity_eia
&generators_eia860
boilers_entity_eia
&boilers_eia860
Tables that depend on harvested values:
boiler_generator_assn_eia860
asset could be split out since it only depends on a couple of input tables, and produces a single output table.balancing_authority
names and IDs could also be split out fromeia_transform
and run in parallel with theboiler_generator_assn_eia860
processing.Breaking this long-running job out into smaller pieces will both clarify what actual dependencies exist, and allow more work to be done in parallel, speeding the ETL up.
Out of Scope