catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Simplify dependencies in entity resolution / harvesting #2387

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Currently all clean EIA-860 and EIA-923 tables are inputs into our entity resolution (harvesting) process as part of the eia_transform multi-asset, and all of the final EIA-860 and EIA-923 tables are also outputs generated by the eia_transform asset. I think this creates the illusion dependency bottleneck which doesn't need to be there.

Only the "harvested" entity and association tables need to be outputs from the complex harvesting process. The fact that all of the tables pass through this process is partly an artifact of our passing around dictionaries of all processed dataframes as inputs/outputs (which obscures real dependencies and makes accidental mutation easy).

Instead, we could continue to pass all the inputs in, but do the work of dropping the columns that don't belong in them separate from entity resolution.

I think the tables that actually need to be outputs from the harvesting process are:

Tables that depend on harvested values:

Breaking this long-running job out into smaller pieces will both clarify what actual dependencies exist, and allow more work to be done in parallel, speeding the ETL up.

# In Scope
- [x] Determine real outputs of `eia_transform` function (entity & annual tables only)
- [x] Create harvested asset factory for entities (plant, boiler, generator, utility)
- [x] Create a separate asset to compile `boiler_generator_assn_eia860`
- [x] Create assets that do final operations on non-entity tables (enforce_schema + ???)
- [x] Get all tables loading into the DB again
- [x] Check that all FK constraints are still respected
- [x] Remove `keep_cols` from harvesting process and just... keep all the cols all the time.
- [x] Run `tox -e validate` locally to check whether we've changed the data.
- [x] Run the full data validation tests locally.

Out of Scope

jdangerx commented 1 year ago

@zaneselvans do you mind throwing a scope checklist in the issue description when you get a chance?