Closed zaneselvans closed 2 years ago
Plant 7238 raises an interesting conflict with the natural primary key of generation_eia923 (which, I know, is a different table). Generators 1, 2, 2A, 3, and 3A are all owned 50/50 by the same two entities. Generators 4 and 4A are owned 100% by a single entity. This makes attribution of fuel tricky for exclusion decisions (if all generators were owned in the same amounts by the same parties, then attribution remains consistent regardless of which generator is using what allocation of the fuel).
Is there a plan to attempt to harmonize the primary key for generation_fuel_eia923 with generation_923 by prorating fuel consumption by power generation across appropriate generators for the given fuel type?
I just spent many hours trying to solve this problem. What I came up with is that there is NO WAY to solve the problem as long as you have a generator_id-like thing (the nuclear_unit_id) in the gf_923 table. It MUST be transported to the gen_923 table. Otherwise all your joins get wonky because the two different generator-ish things behave like dueling keys in a join. If the gf_923 table stays fuel-related and the gen_923 table stays generator-related, then all the joins work like they are supposed to. My $0.02.
Yes, we're working on a method for allocating the more complete net generation (and fuel consumption) reported in the generation_fuel_eia923 table across the different generators, while taking into account the more granular per-generator and per-boiler data that is reported in the generation_eia923 and boiler_fuel_eia923 tables. It will unavoidably be an approximation though.
I'm not sure how this plays into the primary key though. IIRC, the natural primary key for the generation_fuel_eia923 table is (report_date, plant_id_eia, prime_mover_code, fuel_type). I think the nuclear_unit_id column actually contains generator IDs for nuclear units. Honestly I'm not sure why it's included.
The draft net generation allocation process we have right now looks at the energy_source_code and prime_mover_code fields in the generators_eia860 table, and uses them to merge the generation_fuel and generators table together on report_date, plant_id_eia, prime_mover_code, and fuel_type/energy_source_code(s) and then allocates the generation reported to a given combination of prime mover and fuel (within a particular plant-year) in proportion to the generator capacity if no per-generator generation is available, and in proportion to the reported per-generator generation if it is.
On Sun, May 16, 2021 at 9:42 PM Michael Tiemann @.***> wrote:
I just spent many hours trying to solve this problem. What I came up with is that there is NO WAY to solve the problem as long as you have a generator_id-like thing (the nuclear_unit_id) in the gf_923 table. It MUST be transported to the gen_923 table. Otherwise all your joins get wonky because the two different generator-ish things behave like dueling keys in a join. If the gf_923 table stays fuel-related and the gen_923 table stays generator-related, then all the joins work like they are supposed to. My $0.02.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/851#issuecomment-841941034, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAERSN6H2427SO364QMW4BTTOB7ANANCNFSM4UWZMFNA .
-- Zane A. Selvans, PhD Chief Data Wrangler Catalyst Cooperative https://catalyst.coop @.*** Signal/WhatsApp/SMS: +1 720 443 1363 Twitter: @ZaneSelvans https://twitter.com/ZaneSelvans PGP https://www.gnupg.org/: 0x64F7B56F3A127B04
I've deleted this comment because I think I got more wrong than right. Better to remove the confusion than save it for the record.
I'm running into three main issues with this:
nuclear_unit_id
As stated above, nuclear_id_unit
needs to be in the natural key because the fuels are reported at the unit level for nuclear plants.
There are a couple of solution to this. None are amazing :/
report_date-plant_id_eia-fuel_type-prime_mover_code-fuel_type_code_aer
report_date-plant_id_eia-fuel_type-prime_mover_code-fuel_type_code_aer-nuclear_unit_id
fuel_type
sThere are about 5000 MSW fuels that are split by their fuel_type_code_aer codes into MLG and OTH.
Solution:
fuel_type_code_aer
in primary_key.prime_mover_codes
2001 and 2002 are missing prime_mover_codes
.
Solutions
I agree we should be keeping NA
values out of the primary keys.
We need to have the same primary key for nuclear and non-nuclear units be the same don't we, since they'll be in the same table? Maybe I'm not understanding what you mean by "create the primary key in the eia923 transform function".
If we went with the NOTNUC
option most of the records will have the same NOTNUC
value in the nuclear_unit_id
field right? IIRC all of the nuclear_unit_id
values are integers, but they also match 100% to the generator_id
value for nuclear plants, which is a string (often a string representation of an integer...) so having this be a string should be correct (is it already defined as a string?) Though I guess having NOTNUC
included in the nuclear_unit_id
field will preclude us from actually implementing a FK relationship using the field.
I don't think we want to introduce a dependency within the DB on the fuel_type_aer
field if we can help it. It's not used anywhere else in any of our data processing. It's also possible that this is an issue we have introduced. The MSW
fuel type (municipal solid waste) is defined in the EIA 860/923 instructions, including min/max heat content etc. However in the spreadsheet "file layout" tab there are two different kinds of "municipal solid waste" defined: MSB
(municipal solid waste: biogenic) and MSN
(municipal solid waste: non-biogenic). I lumped them together in the automatic code-fixer in the dedupe-metadata
branch because they seemed only to be semi-defined by EIA (and are a rarely used fuel anyway) but I don't think that code has propagated into the dev
branch. Is there a similar lumping or re-coding happening in the generation_fuel_eia923
transform step?
Are there also MSB
and MSN
records in the table? Do they show up in all years, or only in some of them? I could see these duplicate records also being a consequence of EIA not completely defining these different codes for different types of solid waste.
Doing a little googling around it looks like the AER fuel type code MLG
corresponds to the MSB
biogenic municipal waste, while they assign OTH
to the non-biogenic municipal solid waste, so I think you can re-code the fuel_type
column correctly based on the aer_fuel_type_code
and I will make sure to get rid of the MSB
& MSN
lumping that's happening in the auto-encoder.
Backfilling simple plants with a single PM and filling the others with UNK seems like the best of several not awesome options for now.
We've agreed to create a separate table for nuclear fuels because generation_fuels_eia923
records fuel at the unit level for nuclear plants and at the prime mover level for non-nuclear fuels.
You are absolutely correct about the duplicate fuel type codes. Prior to 2006, MSW contained biogenic and non-biogenic fuel types. Starting in 2006 MSW got split into MSB and MSN. I can use fuel_type_code_aer
to map "OTH" -> "MSB" and "MLG" -> "MSN".
1152 (19 plants over many years) duplicate records remain after remapping the fuel type codes. All of these records are missing all of their variable information (["fuel_consumed_units", "net_generation_mwh", ...]
). Is it ok to drop these records? If so can I do:
variable_fields = [
"fuel_consumed_units",
"fuel_consumed_for_electricity_units",
"fuel_mmbtu_per_unit",
"fuel_consumed_mmbtu",
"fuel_consumed_for_electricity_mmbtu",
"net_generation_mwh"
]
# drop all records missing all variable fields.
gen_fuel = gen_fuel.dropna(subset=variable_fields, how="all")
or should I do a targeted dropna() of the 19 plants where they are missing their variable fields?
I backfilled 2001 and 2002 prime_mover_code
for plants with one prime mover. Of the 5224
plants without a prime mover code, 3803 were backfilled. I filled in the remaining missing prime mover codes with 'UNK'. This resulted in 70 plant duplicates.
We could either:
prime_mover_codes
with 'UNK' after backfilling then remove the duplicate records. 266 records are missing fuel_type
. All of the variable fields contain information. There are a couple of options to resolve this:
Currently, setting fuel_type to "UNK" or fuel_type_code_aer does not produce a uniqueness error but it could in the future. Example (true values in parentheses):
plant_id_eia | report_date | fuel_type | prime_mover_code | fuel_type_code aer |
---|---|---|---|---|
1 | 1-1-2020 | UNK (BIT) | ST | COL |
1 | 1-1-2020 | UNK (LIG) | ST | COL |
It looks like there are about 45,000 rows in the table overall that have entirely null data columns. That's a little under 2% of the records. We'd be losing some information about the associations between plants / fuels / prime movers, but I'm not sure if that's useful if there's no data. The prime mover codes are associated with generator IDs, but there's no generator IDs here to connect them to.
I suspect that what's happened here is that there was some data reported in the year for which those null values are showing up for a given plant/prime/fuel combo, but not every month had data. So when we turn the wide-format data into long-format data, we end up getting whole null rows, even though they started out as just partially null rows. Poking around at it a little bit it seems like mostly these years of data are mixes of 0.0 and NaN
values.
I think I'm include to drop all the rows in which all of the data columns you listed are empty, but @cmgosnell might feel differently. Just dropping the ones where the Nulls coincide with a problematic primary key seems like it would definitely be okay.
Wow that's great! 3803 out of 5224!
After If there are duplicate records after you fill in the missing prime mover codes with UNK
is there any reason we wouldn't want to aggregate those duplicate UNK
records within each (plant, fuel type, date) group? Since we can't differentiate between them on the basis of prime mover. I'm a little unclear what you're saying in your comment since you say you filled them in with UNK
and then ask if you should fill them in with UNK
We can't use the AER fuel type codes in the EIA fuel type column since they come from different fuel taxonomies (and in the new metadata, having an AER fuel type code in the EIA fuel type code column will cause a foreign key constraint violation), but we could fill in with our best guess of EIA fuel type, based on the AER fuel type. There are also ranges of expected values for heat content per unit of fuel, and this table has both heat content (mmbtu) and units of fuel consumed in it, so that in combination with the AER fuel type code seems like it would be enough to make a decent guess.
It's also just 200 records out of 2 million, so you could also just fill in with the most commonly reported EIA fuel type that is seen in association with the available AER fuel type and I'm sure it would be good enough.
The
generation_fuel_eia923
should have a natural primary key of:report_date
plant_id_eia
fuel_type
prime_mover_code
nuclear_unit_id
However, there are a bunch of duplicate rows in which all but one of the duplicates is all NaN values. We should probably use a targeted
dropna()
in the transform step to yield a table with a natural primary key.Complication: the
nuclear_unit_id
field is NA for any non-nuclear plant, but required to differentiate between nuclear units in the table.This came out of issue #846