The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
2024-06-25 15:58:00 [ INFO] catalystcoop.pudl.transform.classes:1361 core_ferc1__yearly_hydroelectric_plants_sched406: Enforcing database schema on dataframe.
2024-06-25 15:58:00 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46621 - core_ferc1__yearly_hydroelectric_plants_sched406 - STEP_FAILURE - Execution of step "core_ferc1__yearly_hydroelectric_plants_sched406" failed.
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_hydroelectric_plants_sched406"::
ValueError: invalid literal for int() with base 10: '.2175': Error while type casting for column 'project_num'
Why: There is a decimal in the project_num column and we have this column dtyped as an int
Solution: Easy
If we think decimal is fine, change dtype
If we think decimal is type, bespoke replace with non-decimal
CG tried option 2 locally and... there are more than one of these.. I used removeprefix and it worked locally:
def transform_main(self, df):
"""Add bespoke removal of duplicate record after standard transform_main."""
df = super().transform_main(df).pipe(self.targeted_drop_duplicates)
df.project_num = df.project_num.str.removeprefix(".")
return df
2024-06-25 15:58:37 [ INFO] catalystcoop.pudl.transform.ferc1:2027 core_ferc1__yearly_operating_revenues_sched300: Processing XBRL metadata.
2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46684 - core_ferc1__yearly_operating_revenues_sched300 - STEP_FAILURE - Execution of step "core_ferc1__yearly_operating_revenues_sched300" failed.
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_operating_revenues_sched300"::
AssertionError: We've applied 14 calculation fixes while we started with 13. Length of applied and original fixes should be the same.
replace_me= weight utility_type plant_function plant_status
table_name_parent xbrl_factoid_parent table_name xbrl_factoid
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_operating_revenues_sched300 large_or_industrial 1.0 NaN NaN NaN
small_or_commercial 1.0 NaN NaN NaN
add_me= weight utility_type plant_function plant_status
table_name_parent xbrl_factoid_parent table_name xbrl_factoid
core_ferc1__yearly_operating_revenues_sched300 other_operating_revenues core_ferc1__yearly_operating_revenues_sched300 forfeited_discounts 1.0 NaN NaN NaN
interdepartmental_rents 1.0 NaN NaN NaN
miscellaneous_revenue 1.0 NaN NaN NaN
miscellaneous_service_revenues 1.0 NaN NaN NaN
other_electric_revenue 1.0 NaN NaN NaN
other_miscellaneous_operating_revenues 1.0 NaN NaN NaN
regional_transmission_service_revenues 1.0 NaN NaN NaN
rent_from_electric_property 1.0 NaN NaN NaN
revenues_from_transmission_of_electricity_of_ot... 1.0 NaN NaN NaN
sales_of_water_and_water_power 1.0 NaN NaN NaN
sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial NaN NaN NaN NaN
delete_me= weight utility_type plant_function plant_status
table_name_parent xbrl_factoid_parent table_name xbrl_factoid
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial NaN NaN NaN NaN
Why: It looks like commercial_and_industrial is trying to be added and deleted. Which means many the calculation components for this table changed with the new metadata.
Solution: Easy/Mild
Go track down what changed in the metadata and update xbrl_calculation_component_fixes.csv accordingly. So i double checked and it does look like this commercial_and_industrial calculation component that used to be in the metadata's calculation which we had been removing is now just removed! which is great. it means they fixed something 😎 . So the solution here is just to delete the line in the xbrl_calculation_component_fixes. CG did this locally and the rest of the transform transformed 🟢
2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46714 - core_ferc1__yearly_plant_in_service_sched204 - STEP_FAILURE - Execution of step "core_ferc1__yearly_plant_in_service_sched204" failed.
dagster._core.errors.DagsterExecutionLoadInputError: Error occurred while loading input "raw_xbrl_duration__electric_plant_in_service_204" of step "core_ferc1__yearly_plant_in_service_sched204"::
ValueError: Found 10505 non-null values with apply-diffsmethodology, and 10472 with best-snapshot. apply-diffs shouldn't be more than 0.3% greater than best-snapshot.
Why: Our apply-diffs methodology takes the last non-null value for a fact. Which is probably mostly fine, except when a respondent comes back in to null a previously reported value. This check is to ensure that we don't get too many more non-null records than is suggested with our best-snapshot methodology. In this run we got 0.31% greater non-null records using the apply-diffs methodology as compared to the best-snapshot and we expect that threshold to be only 0.3% or less.
Solutions: Medium
(temp) Up the threshold w/o looking just to see if the rest transform passes -> 🟢 it does! wahoo
Go check in on these extra non-null records to actually make a recommendation. Probably one of: up the threshold, change the filter_for_freshest_data method?
2024-06-25 15:58:52 [ INFO] catalystcoop.pudl.transform.classes:1276 core_ferc1__yearly_steam_plants_fuel_sched402: Categorizing string columns using a controlled vocabulary.
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_type_code_pudl: Found 1 uncategorized values: {'natural gas/fuel oil'}
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_units: Found 3 uncategorized values: {'coal - ton', 'oil- bbl', 'gas - mcf'}
2024-06-25 15:58:52 [ INFO] catalystcoop.pudl.transform.ferc1:3304 core_ferc1__yearly_steam_plants_fuel_sched402: Aggregating 88 rows with duplicate primary keys out of 4136 total rows.
2024-06-25 15:58:52 [ INFO] catalystcoop.pudl.transform.ferc1:3308 core_ferc1__yearly_steam_plants_fuel_sched402: Dropping 352 records with inconsistent fuel units preventing aggregation out of 4136 total rows.
2024-06-25 15:58:52 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46957 - core_ferc1__yearly_steam_plants_fuel_sched402 - STEP_FAILURE - Execution of step "core_ferc1__yearly_steam_plants_fuel_sched402" failed.
dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_steam_plants_fuel_sched402"::
ValueError: core_ferc1__yearly_steam_plants_fuel_sched402 (xbrl): Found null primary key values.
report_year False
utility_id_ferc1_xbrl False
fuel_type_code_pudl True
plant_name_ferc1 False
dtype: bool
Why: The categorization of the fuel_type_code_pudl during categorize_strings nulled one string which caused this error in defining the record_id bc the fuel type is a part of the pk of the original table/
Solution: Easy
Add the string into the FUEL_CATEGORIES
(CG tried this locally and it fixed the problem)
what hasn't been tested bc of these failures?
core_ferc1__yearly_plant_in_service_sched204 is one of the inputs to the exploded/detail table. So all of that stuff feeding into the rate base table isn't yet being tested. All of the intra-table calculations are being tested! Just none of the post-transform metadata/calculation tables or the detialed -> rate base assets
okay now that we have at least a local solution to all of the failures above running everything is now possible. Ahhh they are all green circles 🟢 🎉
Okay the tl;dr is that there are four failures. three are vv simple and already fixed locally. the fourth failure needs a little investigation but doesn't scream red flags.
Note: I used the new metadata only. presumably we will still want to do this #3650
Failures
🔴
core_ferc1__yearly_hydroelectric_plants_sched406
-> Easy 🟢Stage: enforce_schema (last step!)
Error:
Why: There is a decimal in the
project_num
column and we have this column dtyped as anint
Solution: Easy
🔴
core_ferc1__yearly_operating_revenues_sched300
-> Easy 🟢Stage:
apply_xbrl_calculation_fixes
(early)Why: It looks like
commercial_and_industrial
is trying to be added and deleted. Which means many the calculation components for this table changed with the new metadata.Solution: Easy/Mild Go track down what changed in the metadata and update
xbrl_calculation_component_fixes.csv
accordingly. So i double checked and it does look like thiscommercial_and_industrial
calculation component that used to be in the metadata's calculation which we had been removing is now just removed! which is great. it means they fixed something 😎 . So the solution here is just to delete the line in thexbrl_calculation_component_fixes
. CG did this locally and the rest of the transform transformed 🟢🔴
core_ferc1__yearly_plant_in_service_sched204
Stage:
pudl.io_managers.FercXBRLSQLiteIOManager.filter_for_freshest_data
(vv early)Error:
Why: Our apply-diffs methodology takes the last non-null value for a fact. Which is probably mostly fine, except when a respondent comes back in to null a previously reported value. This check is to ensure that we don't get too many more non-null records than is suggested with our best-snapshot methodology. In this run we got 0.31% greater non-null records using the apply-diffs methodology as compared to the best-snapshot and we expect that threshold to be only 0.3% or less.
Solutions: Medium
filter_for_freshest_data
method?🔴
core_ferc1__yearly_steam_plants_fuel_sched402
-> Easy 🟢Stage:
assign_record_id
(mid)Error:
Why: The categorization of the fuel_type_code_pudl during
categorize_strings
nulled one string which caused this error in defining the record_id bc the fuel type is a part of the pk of the original table/Solution: Easy Add the string into the
FUEL_CATEGORIES
(CG tried this locally and it fixed the problem)what hasn't been tested bc of these failures?
core_ferc1__yearly_plant_in_service_sched204
is one of the inputs to the exploded/detail table. So all of that stuff feeding into the rate base table isn't yet being tested. All of the intra-table calculations are being tested! Just none of the post-transform metadata/calculation tables or the detialed -> rate base assets