Modify generation_fuel_eia923 transform to yield natural primary key

catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

https://catalyst.coop/pudl

MIT License

456 stars 106 forks source link

Modify generation_fuel_eia923 transform to yield natural primary key #851

Closed zaneselvans closed 2 years ago

zaneselvans commented 3 years ago

The generation_fuel_eia923 should have a natural primary key of:

report_date
plant_id_eia
fuel_type
prime_mover_code
nuclear_unit_id

However, there are a bunch of duplicate rows in which all but one of the duplicates is all NaN values. We should probably use a targeted dropna() in the transform step to yield a table with a natural primary key.

Complication: the nuclear_unit_id field is NA for any non-nuclear plant, but required to differentiate between nuclear units in the table.

This came out of issue #846

MichaelTiemannOSC commented 3 years ago

Plant 7238 raises an interesting conflict with the natural primary key of generation_eia923 (which, I know, is a different table). Generators 1, 2, 2A, 3, and 3A are all owned 50/50 by the same two entities. Generators 4 and 4A are owned 100% by a single entity. This makes attribution of fuel tricky for exclusion decisions (if all generators were owned in the same amounts by the same parties, then attribution remains consistent regardless of which generator is using what allocation of the fuel).

Is there a plan to attempt to harmonize the primary key for generation_fuel_eia923 with generation_923 by prorating fuel consumption by power generation across appropriate generators for the given fuel type?

MichaelTiemannOSC commented 3 years ago

I just spent many hours trying to solve this problem. What I came up with is that there is NO WAY to solve the problem as long as you have a generator_id-like thing (the nuclear_unit_id) in the gf_923 table. It MUST be transported to the gen_923 table. Otherwise all your joins get wonky because the two different generator-ish things behave like dueling keys in a join. If the gf_923 table stays fuel-related and the gen_923 table stays generator-related, then all the joins work like they are supposed to. My $0.02.

zaneselvans commented 3 years ago

Yes, we're working on a method for allocating the more complete net generation (and fuel consumption) reported in the generation_fuel_eia923 table across the different generators, while taking into account the more granular per-generator and per-boiler data that is reported in the generation_eia923 and boiler_fuel_eia923 tables. It will unavoidably be an approximation though.

I'm not sure how this plays into the primary key though. IIRC, the natural primary key for the generation_fuel_eia923 table is (report_date, plant_id_eia, prime_mover_code, fuel_type). I think the nuclear_unit_id column actually contains generator IDs for nuclear units. Honestly I'm not sure why it's included.

The draft net generation allocation process we have right now looks at the energy_source_code and prime_mover_code fields in the generators_eia860 table, and uses them to merge the generation_fuel and generators table together on report_date, plant_id_eia, prime_mover_code, and fuel_type/energy_source_code(s) and then allocates the generation reported to a given combination of prime mover and fuel (within a particular plant-year) in proportion to the generator capacity if no per-generator generation is available, and in proportion to the reported per-generator generation if it is.

On Sun, May 16, 2021 at 9:42 PM Michael Tiemann @.***> wrote:

I just spent many hours trying to solve this problem. What I came up with is that there is NO WAY to solve the problem as long as you have a generator_id-like thing (the nuclear_unit_id) in the gf_923 table. It MUST be transported to the gen_923 table. Otherwise all your joins get wonky because the two different generator-ish things behave like dueling keys in a join. If the gf_923 table stays fuel-related and the gen_923 table stays generator-related, then all the joins work like they are supposed to. My $0.02.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/catalyst-cooperative/pudl/issues/851#issuecomment-841941034, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAERSN6H2427SO364QMW4BTTOB7ANANCNFSM4UWZMFNA .

-- Zane A. Selvans, PhD Chief Data Wrangler Catalyst Cooperative https://catalyst.coop @.*** Signal/WhatsApp/SMS: +1 720 443 1363 Twitter: @ZaneSelvans https://twitter.com/ZaneSelvans PGP https://www.gnupg.org/: 0x64F7B56F3A127B04

MichaelTiemannOSC commented 3 years ago

I've deleted this comment because I think I got more wrong than right. Better to remove the confusion than save it for the record.

bendnorman commented 2 years ago

I'm running into three main issues with this:

`nuclear_unit_id`

As stated above, nuclear_id_unit needs to be in the natural key because the fuels are reported at the unit level for nuclear plants.

There are a couple of solution to this. None are amazing :/

Apparently sqlite allows nulls in primary keys. I don’t think this is great practice and will cause issues if we ever switch from sqlite to postgres.
Fill in nuclear_ids with “NOTNUC”. This will work but is kind of a hack.
Create the primary key in the eia923 transform function.
- Non nuclear primary keys will be:report_date-plant_id_eia-fuel_type-prime_mover_code-fuel_type_code_aer
- nuclear primary keys will be:report_date-plant_id_eia-fuel_type-prime_mover_code-fuel_type_code_aer-nuclear_unit_id
- This is nice because nuclear_unit_id column can retain NaNs and the primary key will work if we switch too postgres.

Duplicate `fuel_type`s

There are about 5000 MSW fuels that are split by their fuel_type_code_aer codes into MLG and OTH.

Solution:

Include fuel_type_code_aer in primary_key.
Aggregate MSW variable fields in some way.

Missing `prime_mover_codes`

2001 and 2002 are missing prime_mover_codes.

Solutions

Fill with UNK. This allows us to keep these years but might violate a uniqueness constraint.
back fill for plants with one prime mover and fill others with UNK.
Drop 2001 and 2002 :(
Longer term solution: Use entity resolution to backfill from future years.

zaneselvans commented 2 years ago

I agree we should be keeping NA values out of the primary keys.

nuclear unit IDs

We need to have the same primary key for nuclear and non-nuclear units be the same don't we, since they'll be in the same table? Maybe I'm not understanding what you mean by "create the primary key in the eia923 transform function".

If we went with the NOTNUC option most of the records will have the same NOTNUC value in the nuclear_unit_id field right? IIRC all of the nuclear_unit_id values are integers, but they also match 100% to the generator_id value for nuclear plants, which is a string (often a string representation of an integer...) so having this be a string should be correct (is it already defined as a string?) Though I guess having NOTNUC included in the nuclear_unit_id field will preclude us from actually implementing a FK relationship using the field.

duplicate fuel types

I don't think we want to introduce a dependency within the DB on the fuel_type_aer field if we can help it. It's not used anywhere else in any of our data processing. It's also possible that this is an issue we have introduced. The MSW fuel type (municipal solid waste) is defined in the EIA 860/923 instructions, including min/max heat content etc. However in the spreadsheet "file layout" tab there are two different kinds of "municipal solid waste" defined: MSB (municipal solid waste: biogenic) and MSN (municipal solid waste: non-biogenic). I lumped them together in the automatic code-fixer in the dedupe-metadata branch because they seemed only to be semi-defined by EIA (and are a rarely used fuel anyway) but I don't think that code has propagated into the dev branch. Is there a similar lumping or re-coding happening in the generation_fuel_eia923 transform step?

Are there also MSB and MSN records in the table? Do they show up in all years, or only in some of them? I could see these duplicate records also being a consequence of EIA not completely defining these different codes for different types of solid waste.

Doing a little googling around it looks like the AER fuel type code MLG corresponds to the MSB biogenic municipal waste, while they assign OTH to the non-biogenic municipal solid waste, so I think you can re-code the fuel_type column correctly based on the aer_fuel_type_code and I will make sure to get rid of the MSB & MSN lumping that's happening in the auto-encoder.

Missing prime movers

Backfilling simple plants with a single PM and filling the others with UNK seems like the best of several not awesome options for now.

bendnorman commented 2 years ago

nuclear unit IDs

We've agreed to create a separate table for nuclear fuels because generation_fuels_eia923 records fuel at the unit level for nuclear plants and at the prime mover level for non-nuclear fuels.

duplicate fuel_types

You are absolutely correct about the duplicate fuel type codes. Prior to 2006, MSW contained biogenic and non-biogenic fuel types. Starting in 2006 MSW got split into MSB and MSN. I can use fuel_type_code_aer to map "OTH" -> "MSB" and "MLG" -> "MSN".

1152 (19 plants over many years) duplicate records remain after remapping the fuel type codes. All of these records are missing all of their variable information (["fuel_consumed_units", "net_generation_mwh", ...]). Is it ok to drop these records? If so can I do:

variable_fields = [
    "fuel_consumed_units",
    "fuel_consumed_for_electricity_units",
    "fuel_mmbtu_per_unit",
    "fuel_consumed_mmbtu",
    "fuel_consumed_for_electricity_mmbtu",
    "net_generation_mwh"
]

# drop all records missing all variable fields.
gen_fuel = gen_fuel.dropna(subset=variable_fields, how="all")

or should I do a targeted dropna() of the 19 plants where they are missing their variable fields?

missing prime movers

I backfilled 2001 and 2002 prime_mover_code for plants with one prime mover. Of the 5224 plants without a prime mover code, 3803 were backfilled. I filled in the remaining missing prime mover codes with 'UNK'. This resulted in 70 plant duplicates.

We could either:

drop the records we are unable to backfill or
fill in the missing prime_mover_codes with 'UNK' after backfilling then remove the duplicate records.

missing fuel types (new issue!)

266 records are missing fuel_type. All of the variable fields contain information. There are a couple of options to resolve this:

Drop these records. I don't love it because they do contain information in the variable fields
Replace fuel_type with 'UNK'.
Replace fuel_type with fuel_type_code_aer. All of these records have fuel_type_code_aer.

Currently, setting fuel_type to "UNK" or fuel_type_code_aer does not produce a uniqueness error but it could in the future. Example (true values in parentheses):

plant_id_eia	report_date	fuel_type	prime_mover_code	fuel_type_code aer
1	1-1-2020	UNK (BIT)	ST	COL
1	1-1-2020	UNK (LIG)	ST	COL

zaneselvans commented 2 years ago

Null data rows

It looks like there are about 45,000 rows in the table overall that have entirely null data columns. That's a little under 2% of the records. We'd be losing some information about the associations between plants / fuels / prime movers, but I'm not sure if that's useful if there's no data. The prime mover codes are associated with generator IDs, but there's no generator IDs here to connect them to.

I suspect that what's happened here is that there was some data reported in the year for which those null values are showing up for a given plant/prime/fuel combo, but not every month had data. So when we turn the wide-format data into long-format data, we end up getting whole null rows, even though they started out as just partially null rows. Poking around at it a little bit it seems like mostly these years of data are mixes of 0.0 and NaN values.

I think I'm include to drop all the rows in which all of the data columns you listed are empty, but @cmgosnell might feel differently. Just dropping the ones where the Nulls coincide with a problematic primary key seems like it would definitely be okay.

Missing prime movers

Wow that's great! 3803 out of 5224!

After If there are duplicate records after you fill in the missing prime mover codes with UNK is there any reason we wouldn't want to aggregate those duplicate UNK records within each (plant, fuel type, date) group? Since we can't differentiate between them on the basis of prime mover. I'm a little unclear what you're saying in your comment since you say you filled them in with UNK and then ask if you should fill them in with UNK

Missing fuel types

We can't use the AER fuel type codes in the EIA fuel type column since they come from different fuel taxonomies (and in the new metadata, having an AER fuel type code in the EIA fuel type code column will cause a foreign key constraint violation), but we could fill in with our best guess of EIA fuel type, based on the AER fuel type. There are also ranges of expected values for heat content per unit of fuel, and this table has both heat content (mmbtu) and units of fuel consumed in it, so that in combination with the AER fuel type code seems like it would be enough to make a decent guess.

It's also just 200 records out of 2 million, so you could also just fill in with the most commonly reported EIA fuel type that is seen in association with the available AER fuel type and I'm sure it would be good enough.

catalyst-cooperative / pudl

Modify generation_fuel_eia923 transform to yield natural primary key #851

nuclear_unit_id

Duplicate fuel_types

Missing prime_mover_codes

nuclear unit IDs

duplicate fuel types

Missing prime movers

nuclear unit IDs

duplicate fuel_types

missing prime movers

missing fuel types (new issue!)

Null data rows

Missing prime movers

Missing fuel types

`nuclear_unit_id`

Duplicate `fuel_type`s

Missing `prime_mover_codes`