catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Request: EIA 861 data 1990-2000 #923

Open karldw opened 3 years ago

karldw commented 3 years ago

Is your feature request related to a problem? Please describe.

I'm using EIA-861 for a research project, and it would be convenient if I could use the PUDL-compiled version. I'd like data that goes as far back as possible. Currently the 861 files from 1990 through 2000 are not included.

I realize this old dataset might be a bit niche, and therefore not a priority. As the docs mention, only operational data and sales are available that far back (tables operational_data_eia861 and sales_eia861).

Describe the solution you'd like

Ideally, I'd like to have the full history of eia861 data included in the database.

Describe alternatives you've considered

I could also work with the data myself, outside of PUDL.

Additional context

EIA added a new, reformatted version of the old EIA 861 files (1990 through 2011; they kept the old format too). The new version of the old files more closely matches the formatting of the files from 2012 and later, so extraction might be easier. The change happened sometime between April 18 and May 9, 2020. As far as I can tell, the change did not affect the pudl-scraper code because it takes the first available ZIP. The reformatted version is listed second.

karldw commented 3 years ago

I think I can write a PR for this. Would that be useful?

Here's my understanding of the things that need to change:

zaneselvans commented 3 years ago

@aesharpe is the one who compiled all the eia861 metadata mapping the spreadsheets to database tables / columns. She might be able to help describe what needs to happen there and if there's any weirdness to watch out for, and have opinions on this idea overall.

In general I think we're always up for including more years of the data we've already integrated, so long as it doesn't create a lot of additional maintenance overhead.

I imagine that the reformatted version of the data would be preferable since it would probably reduce the effort required to map it all into a uniform structure. But this would require some changes to the spider to make sure that it always grabs the reformatted data in the years when it's available. And then of course it would also require rejiggering the metadata that's already been compiled for 2001-2011. They just added the reformatted files after we had already set this process up.

If you're up for getting this into a PR I think that would be wonderful. @cmgosnell @aesharpe do you have thoughts?

karldw commented 3 years ago

Interestingly, the data are not directly equivalent. There are some fields that show up in the reformatted data but not the original data, and vice versa. In general, the reformatted data availability basically matches what's written on the current EIA website, but the original data sometimes doesn't.

Here are some issues I've run across as I've worked on the column mappings.

So far, I've changed the file mapping to reflect the changes in data availability and column position, but haven't changed the set of variables. Let me know if you all have thoughts!

Edit: made the list above more detailed, organized by table.

zaneselvans commented 3 years ago

Hmm. That seems bad. Maybe we should reach out to someone at EIA about this, since what you're describing is in direct conflict with what they're saying about the reformatted data:

Files were reformatted for the years 1990–2011. No data were changed or updated. The files were reformatted for ease of use and to match the format and titles of the current files.

I wonder how much error checking they did in the reformatting process.

zaneselvans commented 3 years ago

Wow that's a lot of differences. When you are referring to the original data, where are you getting it from? Are you comparing against the Zenodo archive of the 2001-2011 data, which is what our mapping would correspond to? Or are you looking at the currently available files from the EIA website, which could certainly be different than they were when we scraped that data previously.

karldw commented 3 years ago

I'm mainly comparing against the existing column mapping CSVs. I haven't checked whether the currently available data are different than the Zenodo archive.

aesharpe commented 3 years ago

Karl, in the delineation of differences above, do the bullets pertain to the new or old files? For example:

advanced_metering_infrastructure_eia861 table (old, new) short_form variable dropped from 2007-2011

Does this mean that the short_form column was dropped in the old or the new file?

karldw commented 3 years ago

Ah, sorry. I meant the short_form column existed in the old files, but not the reformatted ones. More specifically, it was in the column mapping for the old files, but the columns is absent for those years in the new files.

aesharpe commented 3 years ago

Ah ok, thanks for clarifying. This is indeed strange! I'm hopeful that EIA can provide some clarification here, and then we can make a more informed decision about which file format to use for the 2000-2011 data.

karldw commented 3 years ago

As I'm adding the column mapping for the old utility data, I'm noting that NERC regions have changed since 1990.

  1. Is it okay to follow successor NERC regions? For example, MAPP became MRO.
  2. How do you want to handle merges or splits? MACC, ECAR, and MAIN merged to became RFC, and FRCC split out of SERC.
  3. We don't have a variable for ASCC (Alaska).
karldw commented 3 years ago

In the revised data, Utility_Data_2006.xlsx has two columns called "Retail Marketing", and none for wholesale marketing. Nearby years have wholesale marketing and retail marketing in those same column positions, so I'm going to assume it's the same for 2006.

aesharpe commented 3 years ago

Because the data are annual, we aren't doing any NERC region mapping/merging. Each year reports the NERC regions that existed at the time. It might be helpful, however, to create another stand-alone table that depicted the relationship between NERC regions over time.

Regarding the utility_data_2006.xlsx file--that sounds like a reasonable assumption

I went ahead and took a closer look at some of the issues you were having with the new formatting and found the following:

zaneselvans commented 3 years ago

@aesharpe it sounds like there were a couple of minor column mapping issues on our part, but that mostly these discrepancies between the spreadsheets which we had previously mapped and the ones that are currently available and labeled "reformatted" are due to changes in the spreadsheet contents (not just format) that were done by EIA. Is that correct?

I'm not sure what the table-specific "territory" data contains, but we have integrated the service_territory_eia861 table and cleaned it up so we can add actual county FIPS codes, and then also have routines for compiling individual utility territories into balancing authority territories.

karldw commented 3 years ago

I'm not sure what the table-specific "territory" data contains, but we have integrated the service_territory_eia861 table and cleaned it up so we can add actual county FIPS codes, and then also have routines for compiling individual utility territories into balancing authority territories.

The table-specific *_territories_eia861 data contains the same information as the non-territories equivalent, but for US territories American Samoa (AS), Guam (GU), Northern Mariana Islands (MP), Puerto Rico (PR), and U.S. Virgin Islands (VI).

zaneselvans commented 3 years ago

Ooooooh, that kind of territory. Got it. It's always seemed weird to me that those are broken out separately. If they have the same data structure, and their own state/territory code, why not put them all in the same tables?

karldw commented 3 years ago

I'm running into issues where the data values don't match up between the original and reformatted data. For instance, in the 1998 reformatted data, AEP Generating Co (EIA ID 434) has summer and winter peak demand of 3175 and 2117 MW (cols D and E of Operational_Data_1998.xlsx). In the original data (861TYP1.xls), the same utility has a summer peak load of 1,309,000 (column AE) and a winter peak load of 1,306,000 (column Z).

  1. I'm guessing this is partly a unit difference: kW in the original data, MW in the reformatted.
  2. Even with unit differences, the numbers are quite different. How do you want to handle this?

Other variables, like annual net generation, are the same (8,723,172 MWh for AEP).

zaneselvans commented 3 years ago

If you look at several different years for the same plant is it clear what numbers are right? Are the unit reporting differences uniform across the whole dataset, or is it only few utilities that are using the wrong units? In the FERC 1 data, we handle unit reporting errors like kW vs. MW or lbs vs tons in the transform step, identifying numbers that look like they are clearly off by 1000x or 2000x etc. but it's quite messy. Seems like another thing we need to bring up with EIA.

karldw commented 3 years ago

Just a quick update: I won't be able to work on this for a few weeks, but I'll take a look in April.