Open karldw opened 3 years ago
I think I can write a PR for this. Would that be useful?
Here's my understanding of the things that need to change:
constants.py
--
) in the docssrc/pudl/package_data/meta/xlsx_maps/eia861/
(-1
for most of the tables, since the data aren't there)@aesharpe is the one who compiled all the eia861 metadata mapping the spreadsheets to database tables / columns. She might be able to help describe what needs to happen there and if there's any weirdness to watch out for, and have opinions on this idea overall.
In general I think we're always up for including more years of the data we've already integrated, so long as it doesn't create a lot of additional maintenance overhead.
I imagine that the reformatted version of the data would be preferable since it would probably reduce the effort required to map it all into a uniform structure. But this would require some changes to the spider to make sure that it always grabs the reformatted data in the years when it's available. And then of course it would also require rejiggering the metadata that's already been compiled for 2001-2011. They just added the reformatted files after we had already set this process up.
If you're up for getting this into a PR I think that would be wonderful. @cmgosnell @aesharpe do you have thoughts?
Interestingly, the data are not directly equivalent. There are some fields that show up in the reformatted data but not the original data, and vice versa. In general, the reformatted data availability basically matches what's written on the current EIA website, but the original data sometimes doesn't.
Here are some issues I've run across as I've worked on the column mappings.
advanced_metering_infrastructure_eia861
table (old, new)
short_form
variable dropped from 2007-2011balancing_authority_eia861
table (old, new)
utility_name_eia
variable added to 2001, 2006-2011balancing_authority_name_eia
("Control Area Operator") is available in the utility data file 1995-2000demand_side_management_eia861
table (old, new)
sales_to_ultimate_consumers_mwh
, sales_for_resale_mwh
and nerc_region
variables dropped for 2001-2010entity_type
variable dropped 2001-2004other_
* variables were dropped 2001-2011transportation_
* variables were added 2001-2011demand_side_management
, reported_as_another_company
, energy_savings_independently_verified
, energy_savings_estimates_independently_verified
, major_program_changes
, price_responsive_programs
, time_responsive_programs
, and data_status
variables were removed for the years they appeared (mostly 2007-2011)green_pricing_eia861
table (old, new)
residential_rec_revenue
, commercial_rec_revenue
, industrial_rec_revenue
, transportation_rec_revenue
, total_rec_revenue
, residential_rec_sales_mwh
, commercial_rec_sales_mwh
, industrial_rec_sales_mwh
, transportation_rec_sales_mwh
, and total_rec_sales_mwh
variables were removed for 2010-2011, the only the years they appearedother_customers
variable was removed for 2003-2006distributed_generation_eia861
table: available 2004-2015 in the original (e.g. file6.xls
here), 2010-2015 in the reformatted.
mergers_eia861
table (old, new)
entity_type
and zip_code_4
variables were removed for all years they appeared (2007-2011)net_metering_eia861
table (old, new)
*_chp_cogen_*
variables were removed (previously available in 2010-2011)operational_data_eia861
table (old, new)
wholesale_power_purchases_mwh
and consumed_by_respondent_without_charge_mwh
variables filled in for 1999-2000consumed_by_facility_mwh
variable removed for 2001 and added for 2006-2011 (this seems weird; probably worth checking I didn't make a mistake)unbundled_revenue
variable removed (previously available 2006-2011)delivery_customers_revenue
variable filled in 2007-2011transmission_revenue
variable filled in 2003-2010data_observed
variable dropped 2001-2002sales_eia861
table (old, new)
report_year
variable added for 1999-2000transportation_sales_revenue
, transportation_sales_mwh
, and transportation_sales_customers
removed for 1999-2002 (possibly earlier; the old mapping only goes back to 1999)other_sales_revenue
, other_sales_mwh
, and other_sales_customers
added for 2001-2002 service_territory_eia861
table removed for 1999-2000utility_data_eia861
table (old, new)
operates_generating_plant
variable removed for 2007-2011 (all years in the existing map)advanced_metering_infrastructure_territories_eia861
for 2007-2010sales_territories_eia861
for 1999-2007 (oddly, the reformatted data has sales_territories_eia861
for every year except 1998.)utility_data_territories_eia861
for 2007-2011So far, I've changed the file mapping to reflect the changes in data availability and column position, but haven't changed the set of variables. Let me know if you all have thoughts!
Edit: made the list above more detailed, organized by table.
Hmm. That seems bad. Maybe we should reach out to someone at EIA about this, since what you're describing is in direct conflict with what they're saying about the reformatted data:
Files were reformatted for the years 1990–2011. No data were changed or updated. The files were reformatted for ease of use and to match the format and titles of the current files.
I wonder how much error checking they did in the reformatting process.
Wow that's a lot of differences. When you are referring to the original data, where are you getting it from? Are you comparing against the Zenodo archive of the 2001-2011 data, which is what our mapping would correspond to? Or are you looking at the currently available files from the EIA website, which could certainly be different than they were when we scraped that data previously.
I'm mainly comparing against the existing column mapping CSVs. I haven't checked whether the currently available data are different than the Zenodo archive.
Karl, in the delineation of differences above, do the bullets pertain to the new or old files? For example:
advanced_metering_infrastructure_eia861 table (old, new) short_form variable dropped from 2007-2011
Does this mean that the short_form column was dropped in the old or the new file?
Ah, sorry. I meant the short_form
column existed in the old files, but not the reformatted ones. More specifically, it was in the column mapping for the old files, but the columns is absent for those years in the new files.
Ah ok, thanks for clarifying. This is indeed strange! I'm hopeful that EIA can provide some clarification here, and then we can make a more informed decision about which file format to use for the 2000-2011 data.
As I'm adding the column mapping for the old utility data, I'm noting that NERC regions have changed since 1990.
In the revised data, Utility_Data_2006.xlsx has two columns called "Retail Marketing", and none for wholesale marketing. Nearby years have wholesale marketing and retail marketing in those same column positions, so I'm going to assume it's the same for 2006.
Because the data are annual, we aren't doing any NERC region mapping/merging. Each year reports the NERC regions that existed at the time. It might be helpful, however, to create another stand-alone table that depicted the relationship between NERC regions over time.
Regarding the utility_data_2006.xlsx file--that sounds like a reasonable assumption
I went ahead and took a closer look at some of the issues you were having with the new formatting and found the following:
advanced_metering_infrastructure_eia861
short_form
getting dropped from 2007-2011 was the result of a mapping error. The values for entity_type
between 2007-2011 should have been assigned to short_form
instead. balancing_authority_eia861
utility_name_eia
field to the records from 2001 and 2006-2011 to be useful I suppose. It was strange without them.demand_side_management_eia861
sales_to_ultimate_customers_mwh
and sales_for_resale_mwh
and nerc_region
were dropped for 2001-2009 (they were already gone by 2010). The only reason I can think of why is that they added a whole new sales_to_ultimate_customers
file that had a more granular breakdown of these sales. I did not run any calculations to check if the totals matched up.*_other
columns became the *_transportation
columns.demand_side_management
column is essentially meaningless (all the same), reported_as_another_company
, *_independently_verified
, and major_program_changes
are not null but are sparsely used, price_responsive_programs
and time_responsive_programs
boolean columns are unnecessary when there is a res/com/ind/trans breakdown to indicate whether they existed.green_pricing_eia861
*_rec_revenue
and *_rec_sales_mwh
were removed in 2011 because all values were either 0 or NA. 2010 was not completely NA, but it was likely removed as it would otherwise stand alone.other_customers
was removed for 2003-2006 because all values were 0 or NA.mergers_eia861
entity_type
and zip_code_4
were not completely NA, but were likely removed due to their being relatively inconsequential.net_metering_eia861
green_pricing_eia861
table.*_chp_cogen
values where all 0 or NA in 2011 and mostly 0 or NA in 2010.operational_data_eia861
wholesale_power_purchases_mwh
or consumed_by_respondent_without_charge_mwh
in the 2000 data.consumed_by_facility_mwh
in 2006-2011, but I did notice that it was removed for 2001-2005, likely due to it being all 0 or NA, except in 2001.unbundled_revenue
and delivery_customers_revenue
are the same fields and the mapping ought to change to reflect that.transmission_revenue
added from 2003-2010 is all NAdata_observed
field was dropped for 2001-2002 likely because all values were the same (observed vs. imputed).sales_eia861
report_year
was added for 1999 to 2000! I have no idea why it wasn't there in the first place.transportation_*
fields were called other_*
service_territory_eia861
utility_data_eia861
operates_generating_plant
field was removed for 2007-2011 (all years). It appears to closely, but not exactly, mirror the field generation_activity
so perhaps it was considered redundant.@aesharpe it sounds like there were a couple of minor column mapping issues on our part, but that mostly these discrepancies between the spreadsheets which we had previously mapped and the ones that are currently available and labeled "reformatted" are due to changes in the spreadsheet contents (not just format) that were done by EIA. Is that correct?
I'm not sure what the table-specific "territory" data contains, but we have integrated the service_territory_eia861
table and cleaned it up so we can add actual county FIPS codes, and then also have routines for compiling individual utility territories into balancing authority territories.
I'm not sure what the table-specific "territory" data contains, but we have integrated the service_territory_eia861 table and cleaned it up so we can add actual county FIPS codes, and then also have routines for compiling individual utility territories into balancing authority territories.
The table-specific *_territories_eia861
data contains the same information as the non-territories equivalent, but for US territories American Samoa (AS), Guam (GU), Northern Mariana Islands (MP), Puerto Rico (PR), and U.S. Virgin Islands (VI).
Ooooooh, that kind of territory. Got it. It's always seemed weird to me that those are broken out separately. If they have the same data structure, and their own state/territory code, why not put them all in the same tables?
I'm running into issues where the data values don't match up between the original and reformatted data. For instance, in the 1998 reformatted data, AEP Generating Co (EIA ID 434) has summer and winter peak demand of 3175 and 2117 MW (cols D and E of Operational_Data_1998.xlsx). In the original data (861TYP1.xls), the same utility has a summer peak load of 1,309,000 (column AE) and a winter peak load of 1,306,000 (column Z).
Other variables, like annual net generation, are the same (8,723,172 MWh for AEP).
If you look at several different years for the same plant is it clear what numbers are right? Are the unit reporting differences uniform across the whole dataset, or is it only few utilities that are using the wrong units? In the FERC 1 data, we handle unit reporting errors like kW vs. MW or lbs vs tons in the transform step, identifying numbers that look like they are clearly off by 1000x or 2000x etc. but it's quite messy. Seems like another thing we need to bring up with EIA.
Just a quick update: I won't be able to work on this for a few weeks, but I'll take a look in April.
Is your feature request related to a problem? Please describe.
I'm using EIA-861 for a research project, and it would be convenient if I could use the PUDL-compiled version. I'd like data that goes as far back as possible. Currently the 861 files from 1990 through 2000 are not included.
I realize this old dataset might be a bit niche, and therefore not a priority. As the docs mention, only operational data and sales are available that far back (tables
operational_data_eia861
andsales_eia861
).Describe the solution you'd like
Ideally, I'd like to have the full history of eia861 data included in the database.
Describe alternatives you've considered
I could also work with the data myself, outside of PUDL.
Additional context
EIA added a new, reformatted version of the old EIA 861 files (1990 through 2011; they kept the old format too). The new version of the old files more closely matches the formatting of the files from 2012 and later, so extraction might be easier. The change happened sometime between April 18 and May 9, 2020. As far as I can tell, the change did not affect the pudl-scraper code because it takes the first available ZIP. The reformatted version is listed second.