catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Run 2021 FERC 1 data through new, more complete extractor #2810

Closed jdangerx closed 3 months ago

jdangerx commented 10 months ago

We got the XBRL extractor to drop a lot less data! Let's make sure it works with just one year of data.

https://github.com/catalyst-cooperative/ferc-xbrl-extractor/issues/105

We should point PUDL at the api_compat branch of ferc-xbrl-extractor, which has both the data completeness fixes and some PUDL API compatibility fixes.

We should also use a branch based off of explode_ferc1 to do this testing, because explode_ferc1 has a lot of changes to how we handle FERC1 transformations.

## Scope
- [x] `ferc_to_sqlite` tasks complete successfully, targeting 2021 data;
- [x] `norm_ferc1` asset group completes successfully with 2021 data in 10.5072/zenodo.1234455, while running on the feature branch associated with this issue
- [ ] data validation tests either pass or are updated
- [x] ferc-xbrl-extractor tests are updated to catch future regressions
- [x] FERC 714 asset continue to complete successfully
jdangerx commented 9 months ago

I'm getting these integration test failures locally:

=================================== short test summary info ===================================
FAILED test/integration/etl_test.py::test_pudl_engine - pudl.io_managers.ForeignKeyErrors: Foreign key error for table: denorm_plants_all_ferc1 --...
FAILED test/integration/ferc1_eia_train_test.py::test_generate_all_override_spreadsheets - TypeError: boolean value of NA is ambiguous
FAILED test/integration/glue_test.py::test_for_fk_validation_and_unmapped_ids[missing_plants_in_plants_ferc1] - AssertionError: Found 87 ['utility_id_ferc1', 'plant_name_ferc1']: MultiIndex([(156, 'tota...
FAILED test/integration/glue_test.py::test_for_unmapped_ids_minus_one[check_for_unmmapped_plants_in_plants_ferc1] - AssertionError: Found 88 ['utility_id_ferc1', 'plant_name_ferc1'] but expected 1.
== 4 failed, 79 passed, 4 skipped, 3 xfailed, 5 xpassed, 2895 warnings in 3988.70s (1:06:28) ==

And these validation test failures:

FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fbp_ferc1-25421] - ValueError: fbp_ferc1: found 25423 rows, ex
pected 25421. Off by 0.008%, allowed margin of 0.000%                                                                           
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-fuel_ferc1-48841] - ValueError: fuel_ferc1: found 48843 rows, 
expected 48841. Off by 0.004%, allowed margin of 0.000%                                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plant_in_service_ferc1-311986] - ValueError: plant_in_service_
ferc1: found 311988 rows, expected 311986. Off by 0.001%, allowed margin of 0.000%                                              
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_all_ferc1-54284] - ValueError: plants_all_ferc1: found 
54384 rows, expected 54284. Off by 0.184%, allowed margin of 0.000%                                                             
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-plants_steam_ferc1-30709] - ValueError: plants_steam_ferc1: fo
und 30809 rows, expected 30709. Off by 0.326%, allowed margin of 0.000%                                                         
FAILED test/validate/ferc1_test.py::test_minmax_rows[ferc1_annual-purchased_power_ferc1-197523] - ValueError: purchased_power_fe
rc1: found 197665 rows, expected 197523. Off by 0.072%, allowed margin of 0.000%    
jdangerx commented 9 months ago

The CI seems to be having some trouble - getting cancellation events out of the blue, usually a few minutes into taxonomy parsing.

@zschira and I noticed that the taxonomy parsing is somehow taking much longer than expected, so we might be running into some resource limits.

In other contexts, the Parsing taxonomy... process takes on the order of 10 seconds - but in the xbrl2sqlite context, we're getting much longer runtimes. For example, locally I have:

2023-09-18 15:24:45 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:257 Parsing taxonomy from <_io.BytesIO object at 0x297292bb0>
2023-09-18 15:26:59 [    INFO] catalystcoop.ferc_xbrl_extractor.xbrl:150 Finished batch 1/6
aesharpe commented 9 months ago

Using branch 2810-run-2021-ferc-1-data-through-new-more-complete-extractor to run tests

aesharpe commented 9 months ago

@zaneselvans @zschira

The integration errors mostly pertain to unmapped ferc1 plant ids from total records. There are already 227 total records in the denorm_plants_all_ferc1 table, do we want to:

A) remove all the totals (227+87);

B) map all the 87 totals so we have them all;

C) remove just these 87 totals because there is something ~ different ~ about them seeing as they were recovered from this new extractor.

zschira commented 9 months ago

@aesharpe tbh I don't know that I have enough context to give an informed opinion on the existing total records, but my inclination would be to just drop all of them. It seems to be a very small fraction of the records in the table, and I don't personally see any compelling reason why the the new ones should be treated differently from these.

zaneselvans commented 9 months ago

Can you explain what an "unmapped ferc1 plant ID from total record" is? When you say "plant ID" do you mean the plant name, which is "total"? Or do you mean the integer plant ID that's algorithmically assigned based on record similarity across years? Are the "total" records getting fed into the plant ID assignment process?

Where are the pre-existing "total" plants coming from? Are those junk records from the small plants table? Or were there also some total records from the large steam plants? Have we historically mapped "total" plant IDs? Or are they associated with unmapped plants?

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy -- maybe in some cases that's the only data that's available? But at the same time it's probably very low value data, and may also be resulting in double counting of stuff like capacity and generation, if the total is present and also non-totals are present.

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

aesharpe commented 9 months ago

Can you explain what an "unmapped ferc1 plant ID from total record" is?

When you run the script tox -e get_unmapped_ids on the 2810-run-2021-ferc-1-data-through-new-more-complete-extractor branch there are unmapped plants. These plants all say total in the plant_name_ferc1 field.

Are the "total" records getting fed into the plant ID assignment process?

I think that's what I'm talking about here...that these total records are appearing in the plant id assignment process.

Where are the pre-existing "total" plants coming from?

Many come from the small plants table but some come from the steam table. All the new total values come from the steam table.

I'm a little wary of dropping the old totals since the way plants are reported in the DBF data is so messy

Generally I agree, but it feels weird to drop some but not all unless we have a very specific reason. Maybe that reason is just the messy-ness of the DBF files?

If we're going to drop them, it seems like these totals shouldn't be getting dropped from denorm_plants_all_ferc1 but from the tables upstream where they are originating, shouldn't they?

Yeah, I was just using this table to look at all the potential "total" records in one place.

zaneselvans commented 9 months ago

I agree with @zschira's comment on Slack: if there are totals in DBF, and some totals that already exist and are getting mapped in the XBRL (because they were reported in a way that made them show up with the old extractor) then the new totals that are showing up because we've fixed the extractor to catch the ones that weren't coming through before, they should be retained and mapped for uniformity across these 3 categories.