catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Convert EIA generation and fuel allocations to Dagster #2435

Closed zaneselvans closed 11 months ago

zaneselvans commented 1 year ago

Most of the code being migrated will be in pudl.analysis.allocate_net_gen

There are 3 versions of the net gen allocated table:

The main functions that organize the creation of these tables are:

The input data is generated / handled by:

It looks like those 4 functions are the only ones that need to be converted to use Dagster. However they need to happen at both yearly and monthly frequency, so we probably want a factory of interconnected assets.

- [x] Move `manually_fix_energy_source_codes` spot fix from `allocate_net_gen.py` into main ETL.
- [x] Create `gen_fuel_by_gen_esc` asset (agg only)
- [x] Compare new & old `gen_fuel_by_gen_esc` tables
- [x] Define `gen_fuel_by_gen_esc` table schema
- [x] Switch `gen_fuel_by_gen_esc` to SQLite IO Manager
- [x] Create `gen_fuel_by_gen` asset (agg only)
- [x] Compare new & old `gen_fuel_by_gen` tables
- [x] Define `gen_fuel_by_gen` table schema
- [x] Switch `gen_fuel_by_gen` to SQLite IO Manager
- [x] Create `gen_fuel_by_gen_owner` asset (agg only)
- [x] Compare new & old `gen_fuel_by_gen_esc` tables
- [x] Define `gen_fuel_by_gen_esc_owner` table schema
- [x] Switch `gen_fuel_by_gen_esc_owner` to SQLite IO Manager
- [x] Check for and fix any schema or foreign key violations
- [x] Compare DB versions of all new tables to the legacy `PudlTabl` versions.
- [x] Rewire `PudlTabl` to read these tables from the DB directly
- [x] Rewire `PudlTabl.gen_eia923()` to return generation from `gen_fuel_by_gen` when `fill_net_gen is True`
- [x] Add logic to `PudlTabl` to handle tables that require `self.freq is not None`
- [x] Remove the old `allocate_net_gen` functions.
- [x] Move stuff out of factory function, restore module structure as much as possible.
- [x] Add a `date_merge()` and create a valid monthly allocation by owner.
- [x] Only use overlapping years in the Net Gen Allocation process (no null EIA-860M data)
- [x] Run the full integration and data validation tests.
- [x] Update `allocate_net_gen` docstrings and unit tests.
- [x] Catalog existing net-gen function/table/method names and propose standardized names.

Data Problems (halp! @cmgosnell 🧠)

Design Questions

zaneselvans commented 1 year ago

@cmgosnell Do all of the generation_fuel_by_*_eia923 outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?

zaneselvans commented 1 year ago

Data Validation Test Failures

I think these are all expected, with the 1.446% difference due to the start_date / end_date issue mentioned in the issue body above. Just need to update the numbers and decide what to do with the date selection.

FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 30339 rows, expected 30340. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432599 rows, expected 432602. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 362369 rows, expected 362381. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5097103 rows, expected 5171881. Off by -1.446%, allowed margin of 0.000%

Full Integration Test Failures

These all look expected, except why is it complaining about PudlTabl.gen_fuel_by_generator_energy_source_eia923?

FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_eia923-12.0-kwargs9] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_fuel_by_generator_eia923-12.0-kwargs10] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-capacity_factor-12.0-kwargs15] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-mcoe-12.0-kwargs16] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_eia923] - AttributeError: 'PudlTabl' object has no attribute 'gen_fuel_by_generator_energy_source_eia923'
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_owner_eia923] - AssertionError: Allocated net generation by owner can only be calculated annually. Got a frequency of: MS
cmgosnell commented 1 year ago

There's very low coverage of the capacity_mw and utility_id_eia column in the gen_fuel_by_gen_esc_own table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up with utility_id_eia and capacity_mw values. Since utility_id_eia is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do a date_merge to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.

That sounds like a thing that needs a date_merge for sure!

About 0.6% of all the records are coming out with different values for energy_source_code_num between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.

I'd need you to tell me more about what actually changed in regards to the inputs in the dagster version to give any suggestion here.

When testing that the net gen allocation outputs from PudlTabl were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocated gen_fuel tables they were not! If not start_date or end_date are given when creating a PudlTabl object, it looks up the earliest and latest possible dates using pudl.helpers.get_working_dates_by_datasource(). The start_date and end_date are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through 2022-12-01 even though the latest date we have data for is the EIA-860M which goes through 2022-09-01). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.

I... I don't know. I think it is this way rn because it is not super complicated or finicky to always keep these eia860m dates. If we wanted to do something different, the end_date would be different for different tables which sounds more finicky than its worth imho.

@cmgosnell Do all of the generation_fuelby*_eia923 outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?

only annual or monthly makes sense. I think I added a freq requirement before.

zaneselvans commented 1 year ago

From discussion with @cmgosnell:

zaneselvans commented 1 year ago

The Naming of Things

The different aggregations / allocations of fuel and generation include:

Other names:

@cmgosnell are there other instances of naming in the generation fuel universe that we should be thinking about? Does anything here look crazy or wrong?

Possible names

e-belfer commented 11 months ago

Closed by #2527.