Compare dagster-based ETL output with non-dagster-based ETL output

jdangerx commented 1 year ago

Scope:

[x] merge dev into dagster_asset_etl
[ ] write script (potentially using sqlitediff) to compare legacy/dagster pudl.sqlites.

bendnorman commented 1 year ago

`TABLE_NAME_MAP` change

The TABLE_NAME_MAP data structure has changed so we'll need to update how we use it in pudl.extract.ferc1.create_raw_ferc1_assets() and xbrl_metadata_json. The xbrl and dbf key values can now be a string or a list of strings because some tables have multiple raw table dependencies. The income_statement_ferc1 and electricity_sales_by_rate_schedule_ferc1 tables have multiple raw table dependencies.
create_raw_ferc1_assets iterates through the raw dbf tables in TABLE_NAME_MAP and creates SourceAssets. SourceAssets are also created for each instant and duration xbrl table.
TABLE_NAME_MAP is also used in the ferc1_transform_asset_factory to programmatically generate the dependencies between the raw and clean assets.
I think we have two options here:
- Add some logic in ferc1_transform_asset_factory and create_raw_ferc1_assets functions to handle table names and lists of table names.
- Create a new data structure that just stores all raw dbf and ferc1 table names {"dbf": ["table1",... "tableN"], "xbrl": ["table1",... "tableN"]}. Use this data structure to create the SourceAssets. Instead of using ferc1_transform_asset_factory to programmatically create the transform assets, manually create assets for each clean table. This way we are storing the raw -> clean table dependencies in the asset function args instead of in TABLE_NAME_MAP.

New FERC Tables

There are some new FERC tables from dev. I think we just need to update ferc1_tfr_classes in pudl.transform.ferc1.create_ferc1_transform_assets()

ETL module changes?

It doesn't look like much has changed in the now deprecated src/pudl/etl.py file. We'll have to make sure these recent etl.py changes to the metadata and static tables are retained in this merge.

jdangerx commented 1 year ago

For the TABLE_NAME_MAP stuff - it sounds like 1) create_raw_ferc1_assets would like just a list of all the dbf and xbrl tables that we could possibly want for transform purposes 2) ferc1_transform_asset_factory would like some sort of map from cleaned table to all the raw tables it depends on, or we could abandon it and encode all these dependencies manually

Is that right?

Right now, TABLE_NAME_MAP[_FERC1] provides both a list of all the raw tables as well as the dependencies, which I think is kind of nice - you don't have to worry about adding a dependency for a transformation and then forgetting to add that dependency to the extraction process.

On the other hand, I think dagster sort of wants us to store the dependencies in the asset args. I guess we're doing that anyways when we generate the assets from the map, so maybe this is kind of a moot point.

I think the resulting code structure is roughly "fine" either way, and would push, as always, for "shortest path to getting this merge sorted 😅 "

bendnorman commented 1 year ago

Yeah, I think sticking with the current structure of TABLE_NAME_MAP[_FERC1] makes sense for now. We can change this easily down the line.

bendnorman commented 1 year ago

Once #2318 is merged into #2104 we can compare the database outputs. I think it makes sense to resolve #2296 prior to comparing the outputs so the ETL runs happen in the same environment and the outputs are accessible in cloud buckets. We can use the sqldiff tool to compare the outputs. Binaries for linux and Mac.

zschira commented 1 year ago

I ran sqldiff with the --summary option on the most recent dagster-nightly-builds and dev&prefix=&forceOnObjectsSortingFiltering=false) nightly build outputs with the following results:

balance_sheet_assets_ferc1: 0 changes, 0 inserts, 0 deletes, 238694 unchanged
balance_sheet_liabilities_ferc1: 0 changes, 0 inserts, 0 deletes, 203640 unchanged
balancing_authorities_eia: 0 changes, 0 inserts, 0 deletes, 88 unchanged
boiler_fuel_eia923: 0 changes, 0 inserts, 0 deletes, 1415748 unchanged
boiler_generator_assn_eia860: 0 changes, 0 inserts, 0 deletes, 129869 unchanged
boiler_generator_assn_types_eia: 0 changes, 0 inserts, 0 deletes, 2 unchanged
boilers_entity_eia: 0 changes, 0 inserts, 0 deletes, 6848 unchanged
cash_flow_ferc1: 0 changes, 0 inserts, 0 deletes, 281415 unchanged
coalmine_eia923: 0 changes, 0 inserts, 0 deletes, 4563 unchanged
coalmine_types_eia: 0 changes, 0 inserts, 0 deletes, 5 unchanged
contract_types_eia: 0 changes, 0 inserts, 0 deletes, 4 unchanged
data_maturities: 0 changes, 0 inserts, 0 deletes, 4 unchanged
datasources: 5 changes, 0 inserts, 0 deletes, 0 unchanged
demand_hourly_pa_ferc714: missing from second database
depreciation_amortization_summary_ferc1: 0 changes, 0 inserts, 0 deletes, 203525 unchanged
electric_energy_dispositions_ferc1: 0 changes, 0 inserts, 0 deletes, 24348 unchanged
electric_energy_sources_ferc1: 0 changes, 0 inserts, 0 deletes, 34784 unchanged
electric_operating_revenues_ferc1: 0 changes, 0 inserts, 0 deletes, 70958 unchanged
electric_opex_ferc1: 0 changes, 0 inserts, 0 deletes, 537046 unchanged
electric_plant_depreciation_changes_ferc1: 0 changes, 0 inserts, 0 deletes, 227195 unchanged
electric_plant_depreciation_functional_ferc1: 0 changes, 0 inserts, 0 deletes, 136362 unchanged
electricity_sales_by_rate_schedule_ferc1: 0 changes, 0 inserts, 0 deletes, 279280 unchanged
energy_sources_eia: 0 changes, 0 inserts, 0 deletes, 42 unchanged
entity_types_eia: missing from second database
epacamd_eia: 0 changes, 3900 inserts, 3875 deletes, 2532 unchanged
epacems_output: missing from second database
ferc_accounts: 0 changes, 0 inserts, 0 deletes, 97 unchanged
fuel_ferc1: 0 changes, 0 inserts, 0 deletes, 48841 unchanged
fuel_receipts_costs_aggs_eia: 0 changes, 0 inserts, 0 deletes, 500537 unchanged
fuel_receipts_costs_eia923: 0 changes, 0 inserts, 0 deletes, 608565 unchanged
fuel_transportation_modes_eia: 0 changes, 0 inserts, 0 deletes, 10 unchanged
fuel_types_aer_eia: 0 changes, 0 inserts, 0 deletes, 18 unchanged
generation_eia923: 0 changes, 0 inserts, 0 deletes, 604906 unchanged
generation_fuel_eia923: 0 changes, 0 inserts, 0 deletes, 2683865 unchanged
generation_fuel_nuclear_eia923: 0 changes, 0 inserts, 0 deletes, 24617 unchanged
generators_eia860: 0 changes, 0 inserts, 0 deletes, 523563 unchanged
generators_entity_eia: 0 changes, 0 inserts, 0 deletes, 36652 unchanged
hourly_emissions_epacems: missing from second database
income_statement_ferc1: 0 changes, 0 inserts, 0 deletes, 318644 unchanged
momentary_interruptions_eia: 0 changes, 0 inserts, 0 deletes, 3 unchanged
operational_status_eia: 0 changes, 0 inserts, 0 deletes, 14 unchanged
other_regulatory_liabilities_ferc1: 0 changes, 0 inserts, 0 deletes, 45433 unchanged
ownership_eia860: 0 changes, 0 inserts, 0 deletes, 84440 unchanged
plant_in_service_ferc1: 0 changes, 0 inserts, 0 deletes, 311890 unchanged
plant_parts_eia: missing from second database
plants_eia: 0 changes, 0 inserts, 0 deletes, 15525 unchanged
plants_eia860: 191 changes, 0 inserts, 0 deletes, 185360 unchanged
plants_entity_eia: 50 changes, 0 inserts, 0 deletes, 15464 unchanged
plants_ferc1: 0 changes, 0 inserts, 0 deletes, 7425 unchanged
plants_hydro_ferc1: 0 changes, 0 inserts, 0 deletes, 6796 unchanged
plants_pudl: 0 changes, 0 inserts, 0 deletes, 17227 unchanged
plants_pumped_storage_ferc1: 0 changes, 0 inserts, 0 deletes, 544 unchanged
plants_small_ferc1: 0 changes, 0 inserts, 0 deletes, 16235 unchanged
plants_steam_ferc1: 3396 changes, 10 inserts, 0 deletes, 27303 unchanged
political_subdivisions: 0 changes, 0 inserts, 0 deletes, 69 unchanged
power_purchase_types_ferc1: 0 changes, 0 inserts, 0 deletes, 9 unchanged
prime_movers_eia: 0 changes, 0 inserts, 0 deletes, 24 unchanged
purchased_power_ferc1: 0 changes, 0 inserts, 0 deletes, 197523 unchanged
reporting_frequencies_eia: 0 changes, 0 inserts, 0 deletes, 3 unchanged
respondent_id_ferc714: missing from second database
retained_earnings_appropriations_ferc1: missing from second database
retained_earnings_ferc1: 0 changes, 0 inserts, 0 deletes, 86723 unchanged
sector_consolidated_eia: 0 changes, 0 inserts, 0 deletes, 7 unchanged
steam_plant_types_eia: 0 changes, 0 inserts, 0 deletes, 4 unchanged
transmission_statistics_ferc1: 0 changes, 0 inserts, 0 deletes, 586836 unchanged
utilities_eia: 0 changes, 0 inserts, 0 deletes, 13679 unchanged
utilities_eia860: 0 changes, 0 inserts, 0 deletes, 119365 unchanged
utilities_entity_eia: 34 changes, 0 inserts, 0 deletes, 13637 unchanged
utilities_ferc1: 0 changes, 0 inserts, 0 deletes, 420 unchanged
utilities_ferc1_dbf: 0 changes, 0 inserts, 0 deletes, 411 unchanged
utilities_ferc1_xbrl: 0 changes, 0 inserts, 0 deletes, 269 unchanged
utilities_pudl: 0 changes, 0 inserts, 0 deletes, 14117 unchanged
utility_analysis: missing from second database
utility_plant_assn: 0 changes, 0 inserts, 0 deletes, 16127 unchanged
utility_plant_summary_ferc1: 0 changes, 0 inserts, 0 deletes, 168265 unchanged

Further Work

The following tables contain changes that need to be investigated:

[x] epacamd_eia
- There's a different number of records in each DB. The DB produced by dev contains 6432 and the on from dagster contains 6407. Need more investigation to determine the cause.
[x] plants_eia860
- Looking closer at the changes, there are many records that have different values for balancing_authority_code_eia. Needs further investigation
[x] plants_entity_eia
- There are a number of records with mismatched plant_name_eia values. Needs further investigation.
[ ] plants_steam_ferc1
[ ] utilities_entity_eia

jdangerx commented 1 year ago

To confirm that the changes are due to changes in dev since the last merge of dev into dagster-asset-etl - we can run the sqldiff against an old commit in dev.

jdangerx commented 1 year ago

In dagster-asset-etl, we create empty tables in the sqlite database and then populate - this leads to some tables being empty. We should not generate those tables. There are "*_disabled" etl_groups in dev that we can use to do this filtering, though EPA CEMS might be an outlier here as well.

e-belfer commented 1 year ago

The ten inserts in plant_steam_ferc1 almost certainly come from the spot-fix PR that was merged into dev a few weeks ago, which added ten records to the table.

zschira commented 1 year ago

Just ran the comparison on the dev commit we merged into dagster (31b074e). With this comparison, there are only 2 tables with changes!

plants_entity_eia: 50 changes, 0 inserts, 0 deletes, 15464 unchanged utilities_entity_eia: 34 changes, 0 inserts, 0 deletes, 13637 unchanged

Looking at these changes, they are all contained to plant_name_eia in the plants_entity_eia table, and utility_name_eia in the utilities_entity_eia. Many of these changes seem to be slight spelling changes, like: Novel Holmquist Solar LLC CSG -> Novel Holmquist Solar LLC. I'll keep investigating to see if I can find where these diffs are coming from, and we'll of course need to do this again once the new dev->dagster merge is complete.

zschira commented 1 year ago

Ran the comparisons again with the updated dagster version. I'm still seeing the same sort of changes in the plants_entity_eia and utilities_entity_eia tables. Otherwise everything else is identical, so I'll continue to investigate these two tables.

zaneselvans commented 1 year ago

The same plants & utilities report varying names from year to year, and we have historically treated them as human readable labels and I think right now we may just be grabbing the most recently reported name, based on some sorted dataframe. So maybe the sorting is happening differently or not at all on one of the two branches?

There are also 2 different potential sources for these names:

The standard harvesting / entity resolution process, which tries to pick the most commonly reported value across all the tables where the field shows up across all years (which should be the source of plant_name_eia and utility_name_eia)
The manually assigned PUDL plant & utility IDs, which also include a name. Those would be found in the ID mapping spreadsheets, and can also be looked up in the plants_pudl and utilities_pudl tables in the PUDL DB (which should be the source of plant_name_pudl and utility_name_pudl)

So maybe the plant & utility names are coming from the harvesting process on one branch, and the PUDL plant/util tables in the other somehow?

Also, given how frequently we end up using plant & utility names for record linkage, maybe we should be retaining all permutations that are ever reported so the matches are more robust.

It's a little unsettling that these differences exist, but given these special ambiguities in the source of these names, I don't know that I would consider it a blocking issue.

However, the fact that we rely on these names in multiple record linkage processes means that these differences might have unforeseen consequences, which maybe @cmgosnell and @katie-lamb would be able to speak to more intelligently.

jdangerx commented 1 year ago

Oh. Also the dagster-nightly-builds outputs aren't actually running nightly. So those are actually likely to be out of date. Just kicked off a run so we can take a look in the morning. The branch name should be dagster-asset-etl since dagster-nightly-builds is no more.

zschira commented 1 year ago

@jdangerx thanks! I ran the full ETL locally because I didn't want to wait for nightly builds to finish, but I'll use these outputs now.

@zaneselvans I spot checked a bunch of mismatched values, and neither branch consistently match the manually assigned ID's, so it seems something is slightly different in the harvesting process for some reason.

zaneselvans commented 1 year ago

@zschira That's weird! If anyone understands what might be going on there and how it might not be deterministic, I think it would be @cmgosnell. I don't think this should be a blocker, unless we think it'll have impacts on the record linkage processes which depend on plant & utility names.

bendnorman commented 1 year ago

Unrelated to this harvesting issue but we also need to filter out the table schemas that exist in pudl.metadata but don't actually have any data being loaded into the database for them. We should do this before merging into dev so datasette doesn't have a handful of new empty tables:

demand_hourly_pa_ferc714: etl_group=ferc714
entity_types_eia=static_eia_disabled
hourly_emissions_epacems: ur too big for sqlite. etl_group=epacems
plant_parts_eia: etl_group=outputs
respondent_id_ferc714: etl_group=ferc714
retained_earnings_appropriations_ferc1: etl_group=ferc1_disabled
utility_analysis: example output table that should be removed prior to merging into dev
epacems_output: example output table that should be removed prior to merging into dev

Simple solution would be to just filter out these tables when creating pudl_sqlite_io_manager. I think a better solution would be to create a new classmethod for pudl.metadata.Package that returns a package from a list of etl groups.

cmgosnell commented 1 year ago

the utility names are harvested, not grabbed from the pudl mapping sheet. the names in that sheet are purely for the humans.

Are the new EIA enviro association tables that Ella integrated merged into dagster? If not, then it could definitely just be the fact that those new tables had utilities in them and therefor more fodder for the harvesting process. Or did the work that Kurt was doing to stop dropping columns before the harvesting get merged into dev (and not yet merged back into dagster)?

the utility name in particular is harvested regardless of its consistency, so it just grabs the most consistent name across all of the tables/years even if it only shows up in 20% of the instances. Heck, it could be 5% if 5% was the most consistent value. so a slight influx of possible names could change the output (the default consistency required is 70%). The fact that these rows are changed up but did not increase or decrease makes me feel pretty okay about this small difference.

zaneselvans commented 1 year ago

@zschira the comparison is between the outputs from a particular commit that was on dev and the dagster branch with that commit merged in, right? To avoid potential discrepancies based on the underlying transforms / data.

Given that it's just these name columns that are seeing discrepancies, and the fact the difference only affects about 0.3% of all records in those columns, I think we should declare victory.

zschira commented 1 year ago

@zaneselvans yeah I've been comparing the commit merged into. I still haven't been able to nail down exactly what's going on, so if no one is too worried about it, I agree that we should probably call it good.

I also checked epacems today. I couldn't find a good parquet comparison tool, so I loaded all the data into Dask DataFrames and went year by year loading to pandas and using DataFrame.equals. Using this method I didn't find any discrepancies, so I think we can probably close this issue.

zaneselvans commented 1 year ago

@rousik This is the most recent effort we've made to compare DB outputs from different commits. @zschira would probably be the person to talk to about further automation / generalization for use in CI and/or nightly builds.

catalyst-cooperative / pudl