Closed jdangerx closed 1 year ago
TABLE_NAME_MAP
changeTABLE_NAME_MAP
data structure has changed so we'll need to update how we use it in pudl.extract.ferc1.create_raw_ferc1_assets()
and xbrl_metadata_json
. The xbrl
and dbf
key values can now be a string or a list of strings because some tables have multiple raw table dependencies. The income_statement_ferc1
and electricity_sales_by_rate_schedule_ferc1
tables have multiple raw table dependencies. create_raw_ferc1_assets
iterates through the raw dbf tables in TABLE_NAME_MAP
and creates SourceAssets
. SourceAssets
are also created for each instant and duration xbrl table.TABLE_NAME_MAP
is also used in the ferc1_transform_asset_factory
to programmatically generate the dependencies between the raw and clean assets. ferc1_transform_asset_factory
and create_raw_ferc1_assets
functions to handle table names and lists of table names.{"dbf": ["table1",... "tableN"], "xbrl": ["table1",... "tableN"]}
. Use this data structure to create the SourceAssets
. Instead of using ferc1_transform_asset_factory
to programmatically create the transform assets, manually create assets for each clean table. This way we are storing the raw -> clean table dependencies in the asset function args instead of in TABLE_NAME_MAP
.There are some new FERC tables from dev. I think we just need to update ferc1_tfr_classes
in pudl.transform.ferc1.create_ferc1_transform_assets()
It doesn't look like much has changed in the now deprecated src/pudl/etl.py
file. We'll have to make sure these recent etl.py changes to the metadata and static tables are retained in this merge.
For the TABLE_NAME_MAP
stuff - it sounds like
1) create_raw_ferc1_assets
would like just a list of all the dbf and xbrl tables that we could possibly want for transform purposes
2) ferc1_transform_asset_factory
would like some sort of map from cleaned table to all the raw tables it depends on, or we could abandon it and encode all these dependencies manually
Is that right?
Right now, TABLE_NAME_MAP[_FERC1]
provides both a list of all the raw tables as well as the dependencies, which I think is kind of nice - you don't have to worry about adding a dependency for a transformation and then forgetting to add that dependency to the extraction process.
On the other hand, I think dagster sort of wants us to store the dependencies in the asset args. I guess we're doing that anyways when we generate the assets from the map, so maybe this is kind of a moot point.
I think the resulting code structure is roughly "fine" either way, and would push, as always, for "shortest path to getting this merge sorted 😅 "
Yeah, I think sticking with the current structure of TABLE_NAME_MAP[_FERC1]
makes sense for now. We can change this easily down the line.
Once #2318 is merged into #2104 we can compare the database outputs. I think it makes sense to resolve #2296 prior to comparing the outputs so the ETL runs happen in the same environment and the outputs are accessible in cloud buckets. We can use the sqldiff tool to compare the outputs. Binaries for linux and Mac.
I ran sqldiff
with the --summary
option on the most recent dagster-nightly-builds and dev&prefix=&forceOnObjectsSortingFiltering=false) nightly build outputs with the following results:
balance_sheet_assets_ferc1: 0 changes, 0 inserts, 0 deletes, 238694 unchanged
balance_sheet_liabilities_ferc1: 0 changes, 0 inserts, 0 deletes, 203640 unchanged
balancing_authorities_eia: 0 changes, 0 inserts, 0 deletes, 88 unchanged
boiler_fuel_eia923: 0 changes, 0 inserts, 0 deletes, 1415748 unchanged
boiler_generator_assn_eia860: 0 changes, 0 inserts, 0 deletes, 129869 unchanged
boiler_generator_assn_types_eia: 0 changes, 0 inserts, 0 deletes, 2 unchanged
boilers_entity_eia: 0 changes, 0 inserts, 0 deletes, 6848 unchanged
cash_flow_ferc1: 0 changes, 0 inserts, 0 deletes, 281415 unchanged
coalmine_eia923: 0 changes, 0 inserts, 0 deletes, 4563 unchanged
coalmine_types_eia: 0 changes, 0 inserts, 0 deletes, 5 unchanged
contract_types_eia: 0 changes, 0 inserts, 0 deletes, 4 unchanged
data_maturities: 0 changes, 0 inserts, 0 deletes, 4 unchanged
datasources: 5 changes, 0 inserts, 0 deletes, 0 unchanged
demand_hourly_pa_ferc714: missing from second database
depreciation_amortization_summary_ferc1: 0 changes, 0 inserts, 0 deletes, 203525 unchanged
electric_energy_dispositions_ferc1: 0 changes, 0 inserts, 0 deletes, 24348 unchanged
electric_energy_sources_ferc1: 0 changes, 0 inserts, 0 deletes, 34784 unchanged
electric_operating_revenues_ferc1: 0 changes, 0 inserts, 0 deletes, 70958 unchanged
electric_opex_ferc1: 0 changes, 0 inserts, 0 deletes, 537046 unchanged
electric_plant_depreciation_changes_ferc1: 0 changes, 0 inserts, 0 deletes, 227195 unchanged
electric_plant_depreciation_functional_ferc1: 0 changes, 0 inserts, 0 deletes, 136362 unchanged
electricity_sales_by_rate_schedule_ferc1: 0 changes, 0 inserts, 0 deletes, 279280 unchanged
energy_sources_eia: 0 changes, 0 inserts, 0 deletes, 42 unchanged
entity_types_eia: missing from second database
epacamd_eia: 0 changes, 3900 inserts, 3875 deletes, 2532 unchanged
epacems_output: missing from second database
ferc_accounts: 0 changes, 0 inserts, 0 deletes, 97 unchanged
fuel_ferc1: 0 changes, 0 inserts, 0 deletes, 48841 unchanged
fuel_receipts_costs_aggs_eia: 0 changes, 0 inserts, 0 deletes, 500537 unchanged
fuel_receipts_costs_eia923: 0 changes, 0 inserts, 0 deletes, 608565 unchanged
fuel_transportation_modes_eia: 0 changes, 0 inserts, 0 deletes, 10 unchanged
fuel_types_aer_eia: 0 changes, 0 inserts, 0 deletes, 18 unchanged
generation_eia923: 0 changes, 0 inserts, 0 deletes, 604906 unchanged
generation_fuel_eia923: 0 changes, 0 inserts, 0 deletes, 2683865 unchanged
generation_fuel_nuclear_eia923: 0 changes, 0 inserts, 0 deletes, 24617 unchanged
generators_eia860: 0 changes, 0 inserts, 0 deletes, 523563 unchanged
generators_entity_eia: 0 changes, 0 inserts, 0 deletes, 36652 unchanged
hourly_emissions_epacems: missing from second database
income_statement_ferc1: 0 changes, 0 inserts, 0 deletes, 318644 unchanged
momentary_interruptions_eia: 0 changes, 0 inserts, 0 deletes, 3 unchanged
operational_status_eia: 0 changes, 0 inserts, 0 deletes, 14 unchanged
other_regulatory_liabilities_ferc1: 0 changes, 0 inserts, 0 deletes, 45433 unchanged
ownership_eia860: 0 changes, 0 inserts, 0 deletes, 84440 unchanged
plant_in_service_ferc1: 0 changes, 0 inserts, 0 deletes, 311890 unchanged
plant_parts_eia: missing from second database
plants_eia: 0 changes, 0 inserts, 0 deletes, 15525 unchanged
plants_eia860: 191 changes, 0 inserts, 0 deletes, 185360 unchanged
plants_entity_eia: 50 changes, 0 inserts, 0 deletes, 15464 unchanged
plants_ferc1: 0 changes, 0 inserts, 0 deletes, 7425 unchanged
plants_hydro_ferc1: 0 changes, 0 inserts, 0 deletes, 6796 unchanged
plants_pudl: 0 changes, 0 inserts, 0 deletes, 17227 unchanged
plants_pumped_storage_ferc1: 0 changes, 0 inserts, 0 deletes, 544 unchanged
plants_small_ferc1: 0 changes, 0 inserts, 0 deletes, 16235 unchanged
plants_steam_ferc1: 3396 changes, 10 inserts, 0 deletes, 27303 unchanged
political_subdivisions: 0 changes, 0 inserts, 0 deletes, 69 unchanged
power_purchase_types_ferc1: 0 changes, 0 inserts, 0 deletes, 9 unchanged
prime_movers_eia: 0 changes, 0 inserts, 0 deletes, 24 unchanged
purchased_power_ferc1: 0 changes, 0 inserts, 0 deletes, 197523 unchanged
reporting_frequencies_eia: 0 changes, 0 inserts, 0 deletes, 3 unchanged
respondent_id_ferc714: missing from second database
retained_earnings_appropriations_ferc1: missing from second database
retained_earnings_ferc1: 0 changes, 0 inserts, 0 deletes, 86723 unchanged
sector_consolidated_eia: 0 changes, 0 inserts, 0 deletes, 7 unchanged
steam_plant_types_eia: 0 changes, 0 inserts, 0 deletes, 4 unchanged
transmission_statistics_ferc1: 0 changes, 0 inserts, 0 deletes, 586836 unchanged
utilities_eia: 0 changes, 0 inserts, 0 deletes, 13679 unchanged
utilities_eia860: 0 changes, 0 inserts, 0 deletes, 119365 unchanged
utilities_entity_eia: 34 changes, 0 inserts, 0 deletes, 13637 unchanged
utilities_ferc1: 0 changes, 0 inserts, 0 deletes, 420 unchanged
utilities_ferc1_dbf: 0 changes, 0 inserts, 0 deletes, 411 unchanged
utilities_ferc1_xbrl: 0 changes, 0 inserts, 0 deletes, 269 unchanged
utilities_pudl: 0 changes, 0 inserts, 0 deletes, 14117 unchanged
utility_analysis: missing from second database
utility_plant_assn: 0 changes, 0 inserts, 0 deletes, 16127 unchanged
utility_plant_summary_ferc1: 0 changes, 0 inserts, 0 deletes, 168265 unchanged
The following tables contain changes that need to be investigated:
dev
contains 6432 and the on from dagster contains 6407. Need more investigation to determine the cause.balancing_authority_code_eia
. Needs further investigationplant_name_eia
values. Needs further investigation. To confirm that the changes are due to changes in dev
since the last merge of dev
into dagster-asset-etl
- we can run the sqldiff against an old commit in dev
.
In dagster-asset-etl
, we create empty tables in the sqlite database and then populate - this leads to some tables being empty. We should not generate those tables. There are "*_disabled" etl_group
s in dev
that we can use to do this filtering, though EPA CEMS might be an outlier here as well.
The ten inserts in plant_steam_ferc1
almost certainly come from the spot-fix PR that was merged into dev a few weeks ago, which added ten records to the table.
Just ran the comparison on the dev
commit we merged into dagster (31b074e). With this comparison, there are only 2 tables with changes!
plants_entity_eia: 50 changes, 0 inserts, 0 deletes, 15464 unchanged utilities_entity_eia: 34 changes, 0 inserts, 0 deletes, 13637 unchanged
Looking at these changes, they are all contained to plant_name_eia
in the plants_entity_eia
table, and utility_name_eia
in the utilities_entity_eia
. Many of these changes seem to be slight spelling changes, like: Novel Holmquist Solar LLC CSG
-> Novel Holmquist Solar LLC
. I'll keep investigating to see if I can find where these diffs are coming from, and we'll of course need to do this again once the new dev
->dagster
merge is complete.
Ran the comparisons again with the updated dagster version. I'm still seeing the same sort of changes in the plants_entity_eia
and utilities_entity_eia
tables. Otherwise everything else is identical, so I'll continue to investigate these two tables.
The same plants & utilities report varying names from year to year, and we have historically treated them as human readable labels and I think right now we may just be grabbing the most recently reported name, based on some sorted dataframe. So maybe the sorting is happening differently or not at all on one of the two branches?
There are also 2 different potential sources for these names:
plant_name_eia
and utility_name_eia
)plants_pudl
and utilities_pudl
tables in the PUDL DB (which should be the source of plant_name_pudl
and utility_name_pudl
)So maybe the plant & utility names are coming from the harvesting process on one branch, and the PUDL plant/util tables in the other somehow?
Also, given how frequently we end up using plant & utility names for record linkage, maybe we should be retaining all permutations that are ever reported so the matches are more robust.
It's a little unsettling that these differences exist, but given these special ambiguities in the source of these names, I don't know that I would consider it a blocking issue.
However, the fact that we rely on these names in multiple record linkage processes means that these differences might have unforeseen consequences, which maybe @cmgosnell and @katie-lamb would be able to speak to more intelligently.
Oh. Also the dagster-nightly-builds outputs aren't actually running nightly. So those are actually likely to be out of date. Just kicked off a run so we can take a look in the morning. The branch name should be dagster-asset-etl
since dagster-nightly-builds
is no more.
@jdangerx thanks! I ran the full ETL locally because I didn't want to wait for nightly builds to finish, but I'll use these outputs now.
@zaneselvans I spot checked a bunch of mismatched values, and neither branch consistently match the manually assigned ID's, so it seems something is slightly different in the harvesting process for some reason.
@zschira That's weird! If anyone understands what might be going on there and how it might not be deterministic, I think it would be @cmgosnell. I don't think this should be a blocker, unless we think it'll have impacts on the record linkage processes which depend on plant & utility names.
Unrelated to this harvesting issue but we also need to filter out the table schemas that exist in pudl.metadata
but don't actually have any data being loaded into the database for them. We should do this before merging into dev so datasette doesn't have a handful of new empty tables:
Simple solution would be to just filter out these tables when creating pudl_sqlite_io_manager
. I think a better solution would be to create a new classmethod for pudl.metadata.Package
that returns a package from a list of etl groups.
the utility names are harvested, not grabbed from the pudl mapping sheet. the names in that sheet are purely for the humans.
Are the new EIA enviro association tables that Ella integrated merged into dagster
? If not, then it could definitely just be the fact that those new tables had utilities in them and therefor more fodder for the harvesting process. Or did the work that Kurt was doing to stop dropping columns before the harvesting get merged into dev
(and not yet merged back into dagster
)?
the utility name in particular is harvested regardless of its consistency, so it just grabs the most consistent name across all of the tables/years even if it only shows up in 20% of the instances. Heck, it could be 5% if 5% was the most consistent value. so a slight influx of possible names could change the output (the default consistency required is 70%). The fact that these rows are changed up but did not increase or decrease makes me feel pretty okay about this small difference.
@zschira the comparison is between the outputs from a particular commit that was on dev
and the dagster branch with that commit merged in, right? To avoid potential discrepancies based on the underlying transforms / data.
Given that it's just these name columns that are seeing discrepancies, and the fact the difference only affects about 0.3% of all records in those columns, I think we should declare victory.
@zaneselvans yeah I've been comparing the commit merged into. I still haven't been able to nail down exactly what's going on, so if no one is too worried about it, I agree that we should probably call it good.
I also checked epacems
today. I couldn't find a good parquet comparison tool, so I loaded all the data into Dask DataFrames and went year by year loading to pandas and using DataFrame.equals
. Using this method I didn't find any discrepancies, so I think we can probably close this issue.
@rousik This is the most recent effort we've made to compare DB outputs from different commits. @zschira would probably be the person to talk to about further automation / generalization for use in CI and/or nightly builds.
Scope:
dev
intodagster_asset_etl