catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Try to run transformation of new year of FERC 1 data and document results #3698

Closed cmgosnell closed 5 days ago

cmgosnell commented 5 days ago
### Tasks
- [x] Run new FERC 2023 sqlite db & new metadata through transform
- [x] Document any failures
- [x] Document any good things as well!

Note: I used the new metadata only. presumably we will still want to do this #3650

Failures

🔴 core_ferc1__yearly_hydroelectric_plants_sched406 -> Easy 🟢

Stage: enforce_schema (last step!)

Error:

2024-06-25 15:58:00 [    INFO] catalystcoop.pudl.transform.classes:1361 core_ferc1__yearly_hydroelectric_plants_sched406: Enforcing database schema on dataframe.
2024-06-25 15:58:00 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46621 - core_ferc1__yearly_hydroelectric_plants_sched406 - STEP_FAILURE - Execution of step "core_ferc1__yearly_hydroelectric_plants_sched406" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_hydroelectric_plants_sched406"::

ValueError: invalid literal for int() with base 10: '.2175': Error while type casting for column 'project_num'

Why: There is a decimal in the project_num column and we have this column dtyped as an int

Solution: Easy

  1. If we think decimal is fine, change dtype
  2. If we think decimal is type, bespoke replace with non-decimal CG tried option 2 locally and... there are more than one of these.. I used removeprefix and it worked locally:
    def transform_main(self, df):
        """Add bespoke removal of duplicate record after standard transform_main."""
        df = super().transform_main(df).pipe(self.targeted_drop_duplicates)
        df.project_num = df.project_num.str.removeprefix(".")
        return df

🔴 core_ferc1__yearly_operating_revenues_sched300 -> Easy 🟢

Stage: apply_xbrl_calculation_fixes (early)

2024-06-25 15:58:37 [    INFO] catalystcoop.pudl.transform.ferc1:2027 core_ferc1__yearly_operating_revenues_sched300: Processing XBRL metadata.
2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46684 - core_ferc1__yearly_operating_revenues_sched300 - STEP_FAILURE - Execution of step "core_ferc1__yearly_operating_revenues_sched300" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_operating_revenues_sched300"::

AssertionError: We've applied 14 calculation fixes while we started with 13. Length of applied and original fixes should be the same.
replace_me=                                                                                                                                               weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                     xbrl_factoid                                                        
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_operating_revenues_sched300 large_or_industrial     1.0          NaN            NaN          NaN
                                                                                                                          small_or_commercial     1.0          NaN            NaN          NaN
add_me=                                                                                                                                                                                  weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                         xbrl_factoid                                                                                       
core_ferc1__yearly_operating_revenues_sched300 other_operating_revenues    core_ferc1__yearly_operating_revenues_sched300     forfeited_discounts                                    1.0          NaN            NaN          NaN
                                                                                                                              interdepartmental_rents                                1.0          NaN            NaN          NaN
                                                                                                                              miscellaneous_revenue                                  1.0          NaN            NaN          NaN
                                                                                                                              miscellaneous_service_revenues                         1.0          NaN            NaN          NaN
                                                                                                                              other_electric_revenue                                 1.0          NaN            NaN          NaN
                                                                                                                              other_miscellaneous_operating_revenues                 1.0          NaN            NaN          NaN
                                                                                                                              regional_transmission_service_revenues                 1.0          NaN            NaN          NaN
                                                                                                                              rent_from_electric_property                            1.0          NaN            NaN          NaN
                                                                                                                              revenues_from_transmission_of_electricity_of_ot...     1.0          NaN            NaN          NaN
                                                                                                                              sales_of_water_and_water_power                         1.0          NaN            NaN          NaN
                                               sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial                              NaN          NaN            NaN          NaN
delete_me=                                                                                                                                                         weight utility_type plant_function plant_status
table_name_parent                              xbrl_factoid_parent         table_name                                         xbrl_factoid                                                              
core_ferc1__yearly_operating_revenues_sched300 sales_to_ultimate_consumers core_ferc1__yearly_sales_by_rate_schedules_sche... commercial_and_industrial     NaN          NaN            NaN          NaN

Why: It looks like commercial_and_industrial is trying to be added and deleted. Which means many the calculation components for this table changed with the new metadata.

Solution: Easy/Mild Go track down what changed in the metadata and update xbrl_calculation_component_fixes.csv accordingly. So i double checked and it does look like this commercial_and_industrial calculation component that used to be in the metadata's calculation which we had been removing is now just removed! which is great. it means they fixed something 😎 . So the solution here is just to delete the line in the xbrl_calculation_component_fixes. CG did this locally and the rest of the transform transformed 🟢

🔴 core_ferc1__yearly_plant_in_service_sched204

Stage: pudl.io_managers.FercXBRLSQLiteIOManager.filter_for_freshest_data (vv early)

Error:

2024-06-25 15:58:37 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46714 - core_ferc1__yearly_plant_in_service_sched204 - STEP_FAILURE - Execution of step "core_ferc1__yearly_plant_in_service_sched204" failed.

dagster._core.errors.DagsterExecutionLoadInputError: Error occurred while loading input "raw_xbrl_duration__electric_plant_in_service_204" of step "core_ferc1__yearly_plant_in_service_sched204"::

ValueError: Found 10505 non-null values with apply-diffsmethodology, and 10472 with best-snapshot. apply-diffs shouldn't be more than 0.3% greater than best-snapshot.

Why: Our apply-diffs methodology takes the last non-null value for a fact. Which is probably mostly fine, except when a respondent comes back in to null a previously reported value. This check is to ensure that we don't get too many more non-null records than is suggested with our best-snapshot methodology. In this run we got 0.31% greater non-null records using the apply-diffs methodology as compared to the best-snapshot and we expect that threshold to be only 0.3% or less.

Solutions: Medium

  1. (temp) Up the threshold w/o looking just to see if the rest transform passes -> 🟢 it does! wahoo
  2. Go check in on these extra non-null records to actually make a recommendation. Probably one of: up the threshold, change the filter_for_freshest_data method?
  3. Also, update the error message to be more clear.

🔴 core_ferc1__yearly_steam_plants_fuel_sched402 -> Easy 🟢

Stage: assign_record_id (mid)

Error:

2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.classes:1276 core_ferc1__yearly_steam_plants_fuel_sched402: Categorizing string columns using a controlled vocabulary.
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_type_code_pudl: Found 1 uncategorized values: {'natural gas/fuel oil'}
2024-06-25 15:58:52 [ WARNING] catalystcoop.pudl.transform.classes:474 fuel_units: Found 3 uncategorized values: {'coal - ton', 'oil- bbl', 'gas - mcf'}
2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.ferc1:3304 core_ferc1__yearly_steam_plants_fuel_sched402: Aggregating 88 rows with duplicate primary keys out of 4136 total rows.
2024-06-25 15:58:52 [    INFO] catalystcoop.pudl.transform.ferc1:3308 core_ferc1__yearly_steam_plants_fuel_sched402: Dropping 352 records with inconsistent fuel units preventing aggregation out of 4136 total rows.
2024-06-25 15:58:52 -0400 - dagster - ERROR - etl_full - 17627cb0-37f2-41fa-99f8-980cec7dac31 - 46957 - core_ferc1__yearly_steam_plants_fuel_sched402 - STEP_FAILURE - Execution of step "core_ferc1__yearly_steam_plants_fuel_sched402" failed.

dagster._core.errors.DagsterExecutionStepExecutionError: Error occurred while executing op "core_ferc1__yearly_steam_plants_fuel_sched402"::

ValueError: core_ferc1__yearly_steam_plants_fuel_sched402 (xbrl): Found null primary key values.
report_year              False
utility_id_ferc1_xbrl    False
fuel_type_code_pudl       True
plant_name_ferc1         False
dtype: bool

Why: The categorization of the fuel_type_code_pudl during categorize_strings nulled one string which caused this error in defining the record_id bc the fuel type is a part of the pk of the original table/

Solution: Easy Add the string into the FUEL_CATEGORIES (CG tried this locally and it fixed the problem)

what hasn't been tested bc of these failures?

cmgosnell commented 5 days ago

Okay the tl;dr is that there are four failures. three are vv simple and already fixed locally. the fourth failure needs a little investigation but doesn't scream red flags.