catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
470 stars 108 forks source link

Add data validations for finished EIA-861 and FERC-714 tables #2427

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

We've got a bunch of new tables in the DB! We should figure out some criteria for validating them. We should restrict this to only the close-to-finished tables that we've actually done a lot of work on. The other raw_ and clean_ tables aren't expected to be valid yet.

Validations to include:

- [ ] `balancing_authority_assn_eia861`
- [ ] `balancing_authority_eia861`
- [ ] `demand_hourly_pa_ferc714`
- [ ] `respondent_id_ferc714`
- [ ] `sales_eia861`
- [ ] `service_territory_eia861`
- [ ] `utility_assn_eia861`
Wilson-Energy commented 12 months ago

Not sure if this is the right place for this, but here are some errors in FERC 714 data, table planning_area_hourly_demand_and_forecast_summer_and_winter_peak_demand_and_annual_net_energy_for_load_table_03_2_duration

  1. Oglethorpe (C003561) entered "23" rather than "2023" (etc.) for planning_area_hourly_demand_and_forecast_year
  2. Gulf (C001554) has a stray entry in row 288
  3. PJM (C000030) and NorthWestern (C001789) each has one extra 2033 entry for the 2022 report - this may be ok to retain but it is nonstandard
  4. Avista (C000379) has extra 2033-2045 entries for the 2022 report - overachiever!
  5. Southern (C001556) files "0" entries on this form - its data are filed by subsidiaries GPC, APC and MPC
  6. Minnetoka (C004098), Salt River Project (C004245), WAPA (C011370), and Sq Butte (C011562) have 2021 entries for the 2021 report
  7. Eight entities (C003677, C003749, C000822, C004245, C011367, C011370, C011544, C011508) have 2022 entries for the 2022 report
  8. Widespread errors in annual energy forecast - in most cases, utilities report in the wrong units for certain years, in at least one, maybe more, case the errors are not 1000x or x/1000 but some other misplacement of decimal (doesn't seem to be a problem with seasonal peak forecast data)
  9. SQLite didn't have 2022 714 filings for Bonneville and Dominion SC (C002357, C000241) or 2021/2022 for Burbank (C011431). I obtained them from eCollection site on FERC.
  10. FERC's database has 2019 and 2020 data for Square Butte (R714260) but the energy data were wrong (and they are very lazy forecasters ...). I looked on FERC's website, and the energy data for 2019 are wrong and the 2020 filing is just missing.
  11. Some errors in FERC's db with planning years, such as Missouri River Energy Services in 2018.

Also, the filing names do not match the respondent names in the historical FERC 714 database (through 2020). There is also no number to cross-track between the two, somehow the entity_id seems to be different from the two IDs referenced in the archived FERC database.

zaneselvans commented 11 months ago

Thank you for compiling the list of weirdness @Wilson-Energy!

Unfortunately there's no ID that directly links the old and new data, and we've had to manually map those relationships in the FERC Form 1, so I imagine we'll need to do something similar for the FERC-714 to get continuous time series that cover the whole span of years.