catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Make CEMS aware of timezones #250

Closed karldw closed 5 years ago

karldw commented 5 years ago

Issue

The hourly CEMS times are in local ("naive") timezones, which makes things tricky when you try to line them up with other data sources.

Proposed solution

We talked about timezone cleaning a while ago in https://github.com/catalyst-cooperative/pudl/issues/171#issuecomment-398269132, but it seemed better to talk about the separate details in separate issues.

Alternatives

Additional context

Unmatched plant IDs | plant_id_eia|plant_name |last_date | |------------:|:-------------------------------------|:----------| | 5|Chickasaw |2002-09-30 | | 334|Riverside Canal Power Company |2002-12-31 | | 569|English Station |1999-03-31 | | 596|Madison Street |2005-09-30 | | 1372|Henderson I |2008-09-30 | | 1450|Natchitoches |2010-06-30 | | 1458|Ruston |2007-12-31 | | 1470|Graham Station |1998-09-30 | | 1496|Mason Steam |2004-03-31 | | 1594|Blackstone |2017-12-31 | | 2440|AbiBow US Inc. - Catawba Operations |2017-09-30 | | 2473|North Lovington |2000-03-31 | | 2529|AES Hickling |2007-12-31 | | 2531|AES Jennison |2007-12-31 | | 2858|Gorge |1999-06-30 | | 3451|Dallas |1998-12-31 | | 3461|AES Western Power, LLC |2008-12-31 | | 3480|Denver City |1999-09-30 | | 3523|Oak Creek Power Station |2006-12-31 | | 3524|Paint Creek Power Station |2008-03-31 | | 3526|Rio Pecos Power Station |2007-12-31 | | 3527|San Angelo Power Station |2007-12-31 | | 4036|Commerce |2000-03-31 | | 4233|Buchanan Substation |2001-09-30 | | 4938|Fort Phantom Power Station |2007-12-31 | | 7254|Reedy Creek |2017-12-31 | | 7258|48th Street Peaking Station |2017-12-31 | | 7762|Calpine Hidalgo Energy Center |2017-12-31 | | 7765|Dahlberg (Jackson County) |2017-12-31 | | 7945|Finley Combustion Turbine |2007-03-31 | | 7996|Jasper County Generating Facility |2005-03-31 | | 8058|Donald Von Raesfeld |2005-12-31 | | 10114|Goodyear Tire & Rubber Company |2007-03-31 | | 10321|Liberty Fibers Corporation |2005-09-30 | | 10522|Indeck-Pepperell |2006-12-31 | | 10616|Milford Power (10616) |2004-09-30 | | 10618|South Glens Falls Energy |2006-12-31 | | 10619|Allegany Station No. 133 |2017-12-31 | | 10628|International Paper, Erie Mill |2002-03-31 | | 13213|BTEC New Albany LLC |2008-12-31 | | 14013|The Ohio State University |2005-09-30 | | 50030|Nelson Industrial Steam Company |2017-12-31 | | 50044|The Ohio State University |2017-09-30 | | 50459|Ilion Energy Center |2006-12-31 | | 50607|Trigen Energy - Schuylkill |2017-12-31 | | 50797|Calpine Newark Cogeneration |2008-06-30 | | 50855|Onondaga Cogeneration |2008-06-30 | | 50954|Dan River Inc - Schoolfield Complex |2005-09-30 | | 54088|International Paper-Hudson River |2005-09-30 | | 54089|International Paper-Lock Haven Mill |2002-03-31 | | 54138|Fulton Cogeneration Associates |2008-09-30 | | 54344|ArcelorMittal Weirton,Inc. |2014-09-30 | | 54416|DSM Nutritional Products, Inc |2008-06-30 | | 54425|Project Orange Facility |2010-12-31 | | 54532|Shenango Incorporated |2008-09-30 | | 54571|North East Cogeneration Plant |2017-12-31 | | 54807|Vineland Cogeneration |2004-03-31 | | 55082|Caledonia Power I, LLC |2007-12-31 | | 55209|Brush 4 |2008-03-31 | | 55219|BTEC Southaven, LLC |2008-12-31 | | 55303|Warren Peaking Power Facility |2008-12-31 | | 55683|Boardman Power Holdings LLC |2008-06-30 | | 55703|Memphis Refinery |2017-09-30 | | 55858|Desert Power Plant |2007-09-30 | | 56186|Sheboygan Falls Energy Facility |2007-06-30 | | 70454|Hartwell Energy Facility |2017-12-31 | | 880004|GSA Central Heating |2017-12-31 | | 880006|Trigen Energy Corporation-Edison St |2017-12-31 | | 880007|Honeywell Resins & Chemicals LLC |2017-12-31 | | 880009|Naval Surface Warfare Center |2002-12-31 | | 880013|Solar Turbines |1999-12-31 | | 880016|Bayway Refinery |2008-12-31 | | 880020|Sunoco (R&M) - Marcus Hook |2005-06-30 | | 880021|Pratt & Whitney Willgoos Lab |2003-09-30 | | 880022|Norwich |2006-09-30 | | 880023|Kneeland Station |2017-12-31 | | 880024|Momentive Performance Materials |2017-12-31 | | 880025|ConocoPhillips Co., Trainer Refinery |2017-12-31 | | 880026|UCC South Charleston Plant |2006-06-30 | | 880028|Procter & Gamble Company - Ivorydale |2017-09-30 | | 880029|Marathon Petroleum Company LP |2017-09-30 | | 880030|BP Husky Refining LLC |2017-09-30 | | 880031|The Dow Chemical Company |2017-09-30 | | 880033|Cognis Oleochemicals, LLC |2013-09-30 | | 880038|Catlettsburg Refining, LLC |2009-09-30 | | 880039|Cargill Incorporated |2017-09-30 | | 880041|Ascend (Decatur Plant) |2017-12-31 | | 880042|AK Steel Corporation - Middletown |2017-09-30 | | 880043|Holcim US Inc |2012-12-31 | | 880044|Lafarge Building Materials, Inc. |2016-06-30 | | 880045|University of Michigan |2017-09-30 | | 880049|Bernville Station |2017-09-30 | | 880050|Shermans Dale Station |2017-09-30 | | 880052|Lehigh Northeast Cement Company |2017-09-30 | | 880053|Bayer |2017-09-30 | | 880055|DOE Oak Ridge Y-12 |2009-09-30 | | 880057|INVISTA S.a.r.l. Camden Plant |2017-09-30 | | 880065|Wickliffe Paper Company |2017-09-30 | | 880066|Voridian Columbia Operations |2005-09-30 | | 880067|Wood River Refinery |2017-12-31 | | 880068|Springs Global US - Grace Facility |2008-09-30 | | 880070|Sunoco Inc (R&M) Haverhill Plant |2005-12-31 | | 880071|Armagh Compressor Station |2017-09-30 | | 880072|Entriken Compressor Station |2017-09-30 | | 880074|Marlboro Paper Mill |2017-09-30 | | 880075|BP Amoco Chemical Company |2017-09-30 | | 880076|Lemont Refinery |2017-09-30 | | 880077|Republic Engineered Products-Lorain |2008-09-30 | | 880078|Sonoco Products Company |2013-09-30 | | 880079|Tate & Lyle-Loudon |2017-12-31 | | 880081|DTE Pontiac North LLC |2007-09-30 | | 880083|Lima Refinery |2017-09-30 | | 880086|Aventine Renewable Energy, Inc. |2016-09-30 | | 880087|New Energy Corp |2014-09-30 | | 880088|Marathon Ashland Petroleum LLC |2017-09-30 | | 880089|Flint Hills Resources, - Joliet Plant |2017-09-30 | | 880091|Naval Station Great Lakes |2008-09-30 | | 880092|BP Amoco Chemical Company CR Plant |2017-09-30 | | 880093|Honeywell Intl, Inc Hopewell Plant |2010-09-30 | | 880094|Trigen St. Louis Energy |2008-09-30 | | 880096|Chicago Coke Co., Inc. |2008-12-31 | | 880100|Ravenswood Steam Plant |2017-12-31 | | 880101|RockTenn CP LLC Stevenson Mill |2017-09-30 | | 880107|SPMT Marcus Hook Industrial Complex |2017-12-31 |
zaneselvans commented 5 years ago

Hey @karldw, @cmgosnell is working on creating two EIA 860 tables right now. One is a static entity table, that is the home table for the plant_id_eia values, and stores all of the static plant attributes, including the lat/lon and a derived timezone. The other is an annual plants table that has changeable data associated with the plants. We also had a conversation about how to go about normalizing CEMS -- what columns if any to remove because they're duplicative, like the plant name. Didn't seem like there was a lot that need to be stripped out.

On the missing plant IDs, right now @cmgosnell is harvesting all of the plant IDs mentioned from anywhere in the EIA data, but not the CEMS because it's so big, but with so many more years of data in CEMS it seems like we need to get those unique IDs too. Either by pre-compiling the list from CEMS or maybe by getting the authoritative list of IDs and some basic info about the plants from one of those ID resolver services NEEDS / EPA-EIA etc.

Would you be willing to work with @cmgosnell to figure out the best way to generate/obtain that list of authoritative IDs for the static entity table, and link the CEMS table to it with a foreign key relationship, and use the timezone she's putting in that static table for generating the operating_datetime_utc in the CEMS transform step?

@cmgosnell maybe with these 124 non-860 IDs that are in CEMS, it makes sense to get them into the 860 entity table with at least their state information, since we have that (and maybe the name?) or whatever else is available from NEEDS or the other database, so that a derived timezone can be generated over there using the state as a fallback if the lat/lon aren't available?

cmgosnell commented 5 years ago

On the timezones and latitude and longitude, right now I've made a fair bit of progress with harvesting the fields that should be stable (out of the plants_eia860 table mostly) including latitude and longitude (working on issue #135). The trouble is deciding which year to grab the lat/long from when they are not reported consistently. Right now about 96% of the lat/longs are mostly consistent, but there are still about 350 plants where the lat/long that is reported varies a fair amount.

I plan to make the process all standardized for all of the normalized entity tables but @karldw if it would help you, I can focus on implementing this process for the plant table so you can have the lat/long. It looks like the timezonefinder is simple enough so I could probably just go ahead and add the column to the table. I think I could have it all implemented by the end of the week.

If there are only 124 missing plant id's that don't show up in 860, we could easily add them to the plants entity table, but I expect it would take a little longer than just using the state reported in CEMS to assume the timezone.

karldw commented 5 years ago

I'm happy to help! This isn't urgent, so no need to focus on the plant table if it's easier to work on the rest of the normalization first.

What can I do to help generate a list of authoritative plant IDs?

CCing @lpreonas so he can follow along.

cmgosnell commented 5 years ago

Great, great! I am already using the plants entity table as a template to build from and I'm in the implementation phase right now so I should be able to stay on track to have the lat/long in the plants table soon.

If it isn't urgent, then I think @zaneselvans is on the right track in general in that there seem to be two options: 1) generate a list of missing plant ids w/ any info like name and state from CEMS one time to store somewhere and basically just append this collection to the compiled plant_entity table 2) use the NEEDS ids as the 'complete' list and incorporate all of the ids into the plants_entity_eia table

It seems like the later would be the better option overall because it would be automatically updated. @karldw were you already thinking about using something like NEEDS to get the unit <> smokestack id mapping? I haven't really looked into it at all, but if you think that would be feasible then we could incorporate the ids into the transform step.

karldw commented 5 years ago

That's a good idea. When I was trying to work on that mapping, I was left with a lot of unmatched units, but I'll give it another look.

karldw commented 5 years ago

Another question here: how do you want to handle the case where people load the CEMS data, but not the EIA 860? Some options:

zaneselvans commented 5 years ago

Yeah the 860 stuff is another thing we talked about a little --it's not very big, and it's entangled with just about everything else (except FERC 1, right now), so we decided it would make sense to allow folks to specify which data sources they want to load, but then to override that if a required data source (like 860) hasn't been specified. We can do something similar to what I did with reading the metadata object for creation of database-table-like data packages to identify dependent data sources.

cmgosnell commented 5 years ago

Hey @karldw, I pushed the changes to the plant_entity_eia table so now that is where the lat/long lives for all of the plants (although there were some that we're consistent enough to assign). Let me know if you want to put the timezone in the entity table. I think there is an simple place to slip that into the transform step. If not, I'd be able to do this in a week or so but I need to shift to standardizing the harvest process for all of the entities.

karldw commented 5 years ago

Great! I just pushed (untested) code that starts to implement this: https://github.com/karldw/pudl/commit/91569bb68ca483abeffc16bf574b7553bd46dce8

In that commit, I have a function that pulls out the plant locations and returns a pandas series with timezone names. (Currently the code doesn't work because there are locations that don't parse properly.) But it probably makes more sense to put these timezones in a shared place, rather than just calculate them as part of the CEMS transform step.

If the overall process of finding timezones looks good to you, I can add a timezone column to the entity table.