catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Create exhaustive EIA923 plant info table from spreadsheets #74

Closed zaneselvans closed 7 years ago

zaneselvans commented 7 years ago

In order to populate the plant_info_eia923 table, we need to scrape all of the EIA923 spreadsheets for information about the plants, including their IDs. We need to do this because the plant frame tab only goes back as far as 2011 (and there's easy to use data in 2009-2010), and because not all of the per-plant information is stored in the plant frame tab (e.g. plant regulatory status).

Creating this function will allow us to re-construct a table that's like the plant frame tab for all the years, and give us an exhaustive list of plant_ids for the plants_eia923 table (which is the home table for plant_id -- without being there we can't import a plant, e.g. plant_id=8809 the "Bent Mountain" plant, which only shows up in the latter part of 2016).

swinter2011 commented 7 years ago

No matches. We don't have FERC data yet. There are some plants that could be grouped with other co-located plants in EIA but that would be larger project we'd need to undertake for grouping all unmatched EIA plants.

zaneselvans commented 7 years ago

Yes, but we still need the EIA plant_id values in the plant_id_eia923 table, and they need to have PUDL ids associated with them. The 2016 EIA plants have been put in the output tab now, but it still looks like we don't have a complete list. E.g. "Bent Mountain" plant_id=8809 appears in the fuel_receipts_costs tab, but didn't make it into the plant output tab, so we're still getting a database integrity error.

It looks like we haven't been pulling in a complete list of the plants -- they don't all appear in all of the EIA923 pages, and up until now we've been pulling them from generation_fuel for whatever reason. Probably we need to create an exhaustive list of all the plants in all the tabs (or maybe the plants which are in the plant_frame page is exhaustive for the years where it exists?) and a way to automatically compare the list of plants we've got in the database already, against the new ones (like, whenever a new year's worth of EIA923 data comes out) so that those novel plant_ids, names, etc can be added to the list of canonical EIA plants in the DB.