Integrate EIA930 Hourly Grid Data

zaneselvans commented 8 months ago

The EIA's Form 930 provides a good sub-regional hourly snapshot of the grid, including demand, generation, the breakdown of supply by energy source, and interchange between balancing authorities.

There are multiple open source pre-processing routines we can build upon, including:

Design Considerations

Should the full BA and subregional time series be stored in the same table, or separate tables?
The EIA-930 data is organized by the alphabetical BA Code, rather than EIA's numerical BA ID, how do we want to manage that?
To what extent are the BA Codes in EIA-930, EIA-860, and EIA-861 compatible with each other?
Given the large number of timeseries records, it's probably not appropriate to create a denormalized output table with tons of duplicated data in it.
For the moment, we are not going to do any outlier detection, imputation of missing values, or reconciliation of the interchange data, but those are obvious things which could be built upon this data as a next step.

- [ ] https://github.com/catalyst-cooperative/pudl/issues/3473
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/295
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3486
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3505
- [x] Create an EIA-930 data source documentation page
- [x] Update Kaggle dataset to include new parquet only outputs

zaneselvans commented 8 months ago

Hey @e-belfer I apologize for not peeking in on the EIA-930 PR before it got merged, but the half_years partition for EIA930 doesn’t follow the same convention as the year_months and year_quarters which we have for other sub-annual datasets, which also reflect the ordering of the parts in the filenames, and the order that would allow easy chronological sorting.

It’s also standard in financial reporting to use e.g. 2023H1 to refer to the first half of 2023, in the same way that 2023Q1 refers to the first quarter of 2023, and we’ve used the single letter Q in our quarterly archives, but are spelling out half in the EIA930 archives.

Would you mind if I went through and applied these naming conventions on my own time?

aesharpe commented 7 months ago

@zaneselvans Looking through the PowerGenome repo and I can't find any references to 930. Where else should I look?

zaneselvans commented 7 months ago

Both @jdechalendar's gridemissions and @grgmiller's OGE have routines that clean up the EIA-930 data.

aesharpe commented 7 months ago

@zaneselvans is there anywhere else in PowerGenome I should look? I was just wondering because it was included in the list above.

aesharpe commented 7 months ago

Also @e-belfer can you elaborate on what you meant by your slack message earlier that OGE and GridEmissions use reconciliation methods we don't plant to use?

zaneselvans commented 7 months ago

@e-belfer is referring to this method that Jacques developed, which is used in both OGE and his gridemissions project to ensure that the interchange between all the BAs doesn't violate conservation of energy (which is not true as the data are reported) while minimizing the changes it makes to the data. We're not doing that right now, but we might give it a shot later for other downstream users.

"Physics-informed data reconciliation framework for real-time electricity and emissions tracking", by Jacques A. de Chalendar and Sally M. Benson. Applied Energy 304 (2021): 117761; DOI; arXiv preprint.

There's some more background on his site here.

If it's not coming up in PowerGenome immediately, I might have gotten confused. It's definitely there in OGE and gridemissions, and OGE is derived from gridemissions.

grgmiller commented 7 months ago

I'll mention that a lot of our code involves some steps that just deal with converting the data back and forth between the format used by gridemissions, since we also use that as a dependency, but we don't necessarily think this should be the format used by the data.

I think the most critical step in our code is the manual adjustments to the 930 data that we make here (https://github.com/singularity-energy/open-grid-emissions/blob/83bdce8f08d6120a3f66f60bdfd79cf9ec42d748/src/oge/eia930.py#L369) to deal with inconsistencies in how the data has been reported. Sometimes EIA retroactively fixes these though so each year we always go back and make sure that these still work.

I'll also note that we are currently incorrectly using Jacques's physics reconciliation code. The physics reconciliation code is meant to reconcile small differences in the balance of power AFTER the timeseries have already been screened and imputed for anomalies, which we do not currently do. A good starting point for this would be Tyler Rugge's work on EIA-930 data screening and imputation, although that work was focused on demand data, and not all filters may be relevant to generation or interchange data.

See also:

aesharpe commented 7 months ago

@grgmiller @zaneselvans I'm thinking about how to format the tables, and I'd like your input.

The subregion and interchange tables each have one field for MW values but the balance table has 37 reported MW fields. These consist of the adjusted, imputed, and reported net generation and demand values by fuel type.

Ex:

'net_generation_all_petroleum_products_adjusted_mw',
'net_generation_all_petroleum_products_imputed_mw',
'net_generation_all_petroleum_products_mw',
'demand_adjusted_mw',
'demand_forecast_mw',
'demand_imputed_mw',
'demand_mw',
'sum_valid_directly_interconnected_balancing_authorities_mw' # rouge non demand/net_gen col

I can think of a few ways to format the data:

1) Keep as is

Pros: simple, each column will get a column description which may help with table comprehensibility.
Cons: wide vs. tall table 2) Column for net generation values, column for demand values, column to specify fuel type (including total), column to specify adjusted/imputed/reported/forecast
Pros: more organized, filterable
Cons: not all categories apply to all fields. Ex: demand has a forecast value and net_generation does not, net_generation has fuel types specified and demand is just total values. There is also a sum_valid_directly_interconnected_balancing_authorities_mw value that do not follow the above pattern. 3) Column for mw values, column containing the name of the old columns (EX: demand_forcast, net_generation_nuclear, sum_valid_directly_interconnected_balancing_authorities)

I think I'm in favor of option 2 right now even though it's not perfect!

Another consideration is compatibility with the other tables. I think the formats should be similar if we decided to keep the tables separate -- which is another question up for debate. I think I'm pro keeping them separate, but I want to hear everyone's opinion.

grgmiller commented 7 months ago

I suppose another option is to break these out into separate tables entirely - like have a net generation table, an interchange table, a demand table, etc. Or structure it similar to how the API queries/responses are structured: https://www.eia.gov/opendata/browser/electricity/rto/region-data

aesharpe commented 7 months ago

@grgmiller That's a good point! Looking at the OGE code, it seems like you combine the interchange and balance tables in a long format and leave out the subregion table. Just flagging this so we can keep in mind as we consider different formats.

Do you have feelings about combining the the balance/subregion tables into one vs. keeping them separate?

grgmiller commented 7 months ago

No real preference here, I haven't ever really worked with the subregional demand data so not sure how it's structured.

aesharpe commented 7 months ago

For context, the tables referenced by the EIA API are essentially:

The balance table without the by-fuel-type net_gen values
The by-fuel-type net gen values from the balance table
The subregion table
The interchange table
The balance table without the by-fuel-type net_gen values aggregated by day

I think we want to reduce data duplication so we probably won't want the day-level aggregation table. We also probably won't need net_gen totals in the balance table if we have the net_gen by-fuel subcomponents.

zaneselvans commented 7 months ago

In the wide format that's most like the original data, what are the primary keys? Is it just the timestamp and the Balancing Authority code?

For most of our time series (CEMS, FERC-714, the new GridPath renewables profiles) we've gone with a format that has a timestamp, along with ID columns (e.g. respondent ID, or plant and generator IDs) as the primary key, and then a number of different value columns. I think there's a case to made that the "wide" format here (option 1) also follows this pattern, even though in this case the data columns mostly have the same units (MW)

If we reshape it to create a number of categorical columns (like fuel type and adjusted/forecasted/reported/imputed) but those categorical columns don't apply to all of the data columns, then we'll end up with NA values in the primary key, which creates problems, and we might want to break it out into multiple tables in that case, such that only columns which are associated with the same set of categories are grouped together in a table, but if there are lots of different combinations of categorical values which pertain to their own set of columns that seems overly complex.

So for the moment I think I might support option 1.

IIRC a lot of the fixes were being applied by timestamp + BA code, in which case those fixes would be more directly applicable to the wide format?

zaneselvans commented 7 months ago

Agree that we probably don't want to store aggregations (daily totals, or all-fuel totals) in the core table. If those are desirable we can produce them in an output table.

aesharpe commented 7 months ago

what are the primary keys? Is it just the timestamp and the Balancing Authority code?

Yes, balancing_authority_code_eia and report_datetime_utc are the primary keys

zaneselvans commented 7 months ago

If the subregion data has the same structure as the whole balancing area data with a timestamp + region code as the primary key, and is reporting a value which is also reported in by the balancing areas, then putting them in the same table seems reasonable. We did this with the renewable generation profiles since the wind, solar, and hybrid storage profiles all had exactly the same structure, and you can easily select which series or type of series you want based on the ID columns.

aesharpe commented 7 months ago

If we do this (combined balance and subregion tables), we could hypothetically remove the BA level demand data because it would be duplicative. However, we would have to add subregion as another primary key and then all the total values would be NA which would introduce NA primary keys (an issue you mentioned above).

zaneselvans commented 7 months ago

The balancing area level data feels more "primary" to me than the totals by fuel or by day. It also covers a longer time period than the subregional data, and it would be nice to have the continuous timeseries for the balancing areas from 2015 to the present. So I would lean toward keeping all the years of balancing area data alongside the subregional data for the years in which it is available.

aesharpe commented 7 months ago

How do you recommend we combine the tables without NA primary keys? We could just have a sub region called "total"

zaneselvans commented 7 months ago

I was imagining that in the early years, there just wouldn't be any records associated with the subregions.

zaneselvans commented 7 months ago

Iteration with Ana/Elaine another users on what further cleaning to do will happen in another issue.

catalyst-cooperative / pudl

Integrate EIA930 Hourly Grid Data #3466

Design Considerations