Closed zaneselvans closed 11 months ago
@cmgosnell Do all of the generation_fuel_by_*_eia923
outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?
I think these are all expected, with the 1.446% difference due to the start_date
/ end_date
issue mentioned in the issue body above. Just need to update the numbers and decide what to do with the date selection.
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 30339 rows, expected 30340. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 46407 rows, expected 46408. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 432567 rows, expected 432570. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 432599 rows, expected 432602. Off by -0.001%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 362369 rows, expected 362381. Off by -0.003%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 555107 rows, expected 555119. Off by -0.002%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5096719 rows, expected 5171497. Off by -1.446%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5097103 rows, expected 5171881. Off by -1.446%, allowed margin of 0.000%
These all look expected, except why is it complaining about PudlTabl.gen_fuel_by_generator_energy_source_eia923
?
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_eia923-12.0-kwargs9] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-gen_fuel_by_generator_eia923-12.0-kwargs10] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-capacity_factor-12.0-kwargs15] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_eia_outputs[gens_eia860-mcoe-12.0-kwargs16] - AssertionError: Expected ratio of distinct report_date values to be 12.0, but found 261 / 22 = 11.863636363636363
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_eia923] - AttributeError: 'PudlTabl' object has no attribute 'gen_fuel_by_generator_energy_source_eia923'
FAILED test/integration/output_test.py::test_annual_eia_outputs[gen_fuel_by_generator_energy_source_owner_eia923] - AssertionError: Allocated net generation by owner can only be calculated annually. Got a frequency of: MS
There's very low coverage of the capacity_mw and utility_id_eia column in the gen_fuel_by_gen_esc_own table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up with utility_id_eia and capacity_mw values. Since utility_id_eia is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do a date_merge to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.
That sounds like a thing that needs a date_merge
for sure!
About 0.6% of all the records are coming out with different values for energy_source_code_num between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.
I'd need you to tell me more about what actually changed in regards to the inputs in the dagster version to give any suggestion here.
When testing that the net gen allocation outputs from PudlTabl were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocated gen_fuel tables they were not! If not start_date or end_date are given when creating a PudlTabl object, it looks up the earliest and latest possible dates using pudl.helpers.get_working_dates_by_datasource(). The start_date and end_date are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through 2022-12-01 even though the latest date we have data for is the EIA-860M which goes through 2022-09-01). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.
I... I don't know. I think it is this way rn because it is not super complicated or finicky to always keep these eia860m dates. If we wanted to do something different, the end_date would be different for different tables which sounds more finicky than its worth imho.
@cmgosnell Do all of the generation_fuelby*_eia923 outputs only make sense when aggregated yearly or monthly? Like there's no "original" versions of them with all the records, correct?
only annual or monthly makes sense. I think I added a freq requirement before.
From discussion with @cmgosnell:
energy_source_code_num
values for now.The different aggregations / allocations of fuel and generation include:
denorm_generation_AGG_eia923
PudlTabl.gen_original_eia923()
denorm_generation_fuel_combined_AGG_eia923
PudlTabl.gf_eia923()
denorm_boiler_fuel_AGG_eia923
PudlTabl.bf_eia923()
generation_fuel_by_generator_energy_source_AGG_eia923
PudlTabl.gen_fuel_by_generator_energy_source_eia923
gen_fuel_by_gen_esc()
generation_fuel_by_generator_energy_source_owner_yearly_eia923
PudlTabl.gen_fuel_by_generator_energy_source_owner_eia923()
gen_fuel_by_gen_esc_owner()
generation_fuel_by_generator_AGG_eia923
PudlTabl.gen_fuel_by_generator_eia923()
gen_fuel_by_gen()
pudl.analysis.allocate_net_gen
the variable gen_pm_fuel
often used for dataframes reported by generator, prime mover, and fuel?pudl.analysis.allocate_net_gen
there's both _fuel
and _esc
(energy source code) to indicate a dataframe that has energy_source_code
as part of its PK.plant_prime_fuel
or plant_prime_mover
. Does plant_
always indicate that there is no generator ID present? Does plant_prime_fuel
mean (plant_id_eia
, prime_mover_code
, energy_source_code
)? How do we or should we differentiate between plant, prime mover, energy source vs. generator, energy source (where a combination of (report_date
, plant_id_eia
, generator_id
) necessarily implies a particular prime_mover_code
)? @cmgosnell are there other instances of naming in the generation fuel universe that we should be thinking about? Does anything here look crazy or wrong?
generation_fuel
(the original, un-allocated data)generation_by_generator
generation_fuel_by_energy_source_code
generation_fuel_by_owner
Closed by #2527.
Most of the code being migrated will be in
pudl.analysis.allocate_net_gen
There are 3 versions of the net gen allocated table:
gen_fuel_by_generator_energy_source
is the primary table from which the others are derived.gen_fuel_by_generator_eia923
is an aggregation of the above table.gen_fuel_by_generator_energy_source_owner_eia923
is an allocation of it.gen_fuel_by_generator_eia923
is also used to provide a more complete filled version ofgeneration_eia923
.The main functions that organize the creation of these tables are:
allocate_gen_fuel_by_generator_energy_source
aggregate_gen_fuel_by_generator
The input data is generated / handled by:
extract_input_tables
standardize_input_frequency
It looks like those 4 functions are the only ones that need to be converted to use Dagster. However they need to happen at both yearly and monthly frequency, so we probably want a factory of interconnected assets.
Data Problems (halp! @cmgosnell 🧠)
capacity_mw
andutility_id_eia
column in thegen_fuel_by_gen_esc_own
table, but only when it's aggregated to monthly frequency. It turns out this is because the ownership information is being merged on only annually rather than broadcast / date-merged, so in the monthly case only the January records end up withutility_id_eia
andcapacity_mw
values. Sinceutility_id_eia
is part of the primary key, this is invalid. So we need to either only create this particular table with annual frequency (is it only used for the FERC 1 to EIA entity matching?) or need to do adate_merge
to bring in the utility ID and capacity information. For the moment I'm skipping generating this asset at monthly resolution but that feels a little weird.energy_source_code_num
between the existing PudlTabl and the new Dagster calculation. However, this only happens sometimes. The rest of the time they match up 100%. My hunch is that this is happening when there's a tie between multiple energy source codes that are being filled in, and the ordering of something determines which value ends up in slot 1 vs. 2. If that's the case then this doesn't matter (and it was probably happening before but we never noticed...). If it's something else... maybe it still doesn't matter, since it's only 0.6% of all records.PudlTabl
were consistent with those from reading data out of the DB directly or asking Dagster to give me the table, I discovered that for the monthly allocatedgen_fuel
tables they were not! If notstart_date
orend_date
are given when creating aPudlTabl
object, it looks up the earliest and latest possible dates usingpudl.helpers.get_working_dates_by_datasource()
. Thestart_date
andend_date
are used to restrict the records that are read out of the DB. However, in the case of the monthly allocated gen_fuel tables there are some dates which are after the latest possible date (e.g. right now they run through2022-12-01
even though the latest date we have data for is the EIA-860M which goes through2022-09-01
). I think this is because there's some monthly time series filling going on in the generation allocation process. Is this really what we want to do? It looks like all of the data for 2022 is NA because the EIA-860M doesn't contain any generation or fuel. It seems like ideally we wouldn't be writing that whole year of NA data into the DB.Design Questions
energy_source_code_num
is actually a string likeenergy_source_code_3
which is confusing. Maybe a name change?gen_fuel_by_gen
table is denormalized. Why is that? Should all of these outputs be denormalized to include plant & utility names,unit_id_pudl
and other relevant IDs?utility_id_eia
in thegen_fuel_by_gen_esc_owner
table is confusing, since in this context (as in theownership_eia860
table) the utility here is the owner, not the operator, and we may want to indicate that. Also, if we denormalize this table and bring in additional plant & utility information it'll have a name collision with the operating utility.gen_fuel_by_gen
since it's just a simple subset of the columns.