Closed zaneselvans closed 7 months ago
Hey @e-belfer I apologize for not peeking in on the EIA-930 PR before it got merged, but the half_years
partition for EIA930 doesn’t follow the same convention as the year_months
and year_quarters
which we have for other sub-annual datasets, which also reflect the ordering of the parts in the filenames, and the order that would allow easy chronological sorting.
It’s also standard in financial reporting to use e.g. 2023H1
to refer to the first half of 2023, in the same way that 2023Q1
refers to the first quarter of 2023, and we’ve used the single letter Q in our quarterly archives, but are spelling out half in the EIA930 archives.
Would you mind if I went through and applied these naming conventions on my own time?
@zaneselvans Looking through the PowerGenome repo and I can't find any references to 930. Where else should I look?
Both @jdechalendar's gridemissions and @grgmiller's OGE have routines that clean up the EIA-930 data.
@zaneselvans is there anywhere else in PowerGenome I should look? I was just wondering because it was included in the list above.
Also @e-belfer can you elaborate on what you meant by your slack message earlier that OGE and GridEmissions use reconciliation methods we don't plant to use?
@e-belfer is referring to this method that Jacques developed, which is used in both OGE and his gridemissions project to ensure that the interchange between all the BAs doesn't violate conservation of energy (which is not true as the data are reported) while minimizing the changes it makes to the data. We're not doing that right now, but we might give it a shot later for other downstream users.
"Physics-informed data reconciliation framework for real-time electricity and emissions tracking", by Jacques A. de Chalendar and Sally M. Benson. Applied Energy 304 (2021): 117761; DOI; arXiv preprint.
There's some more background on his site here.
If it's not coming up in PowerGenome immediately, I might have gotten confused. It's definitely there in OGE and gridemissions, and OGE is derived from gridemissions.
I'll mention that a lot of our code involves some steps that just deal with converting the data back and forth between the format used by gridemissions, since we also use that as a dependency, but we don't necessarily think this should be the format used by the data.
I think the most critical step in our code is the manual adjustments to the 930 data that we make here (https://github.com/singularity-energy/open-grid-emissions/blob/83bdce8f08d6120a3f66f60bdfd79cf9ec42d748/src/oge/eia930.py#L369) to deal with inconsistencies in how the data has been reported. Sometimes EIA retroactively fixes these though so each year we always go back and make sure that these still work.
I'll also note that we are currently incorrectly using Jacques's physics reconciliation code. The physics reconciliation code is meant to reconcile small differences in the balance of power AFTER the timeseries have already been screened and imputed for anomalies, which we do not currently do. A good starting point for this would be Tyler Rugge's work on EIA-930 data screening and imputation, although that work was focused on demand data, and not all filters may be relevant to generation or interchange data.
See also:
@grgmiller @zaneselvans I'm thinking about how to format the tables, and I'd like your input.
The subregion
and interchange
tables each have one field for MW values but the balance
table has 37 reported MW fields. These consist of the adjusted, imputed, and reported net generation and demand values by fuel type.
Ex:
'net_generation_all_petroleum_products_adjusted_mw',
'net_generation_all_petroleum_products_imputed_mw',
'net_generation_all_petroleum_products_mw',
'demand_adjusted_mw',
'demand_forecast_mw',
'demand_imputed_mw',
'demand_mw',
'sum_valid_directly_interconnected_balancing_authorities_mw' # rouge non demand/net_gen col
I can think of a few ways to format the data:
1) Keep as is
demand
has a forecast value and net_generation
does not, net_generation
has fuel types specified and demand
is just total values. There is also a sum_valid_directly_interconnected_balancing_authorities_mw
value that do not follow the above pattern.
3) Column for mw values, column containing the name of the old columns (EX: demand_forcast
, net_generation_nuclear
, sum_valid_directly_interconnected_balancing_authorities
)I think I'm in favor of option 2 right now even though it's not perfect!
Another consideration is compatibility with the other tables. I think the formats should be similar if we decided to keep the tables separate -- which is another question up for debate. I think I'm pro keeping them separate, but I want to hear everyone's opinion.
I suppose another option is to break these out into separate tables entirely - like have a net generation table, an interchange table, a demand table, etc. Or structure it similar to how the API queries/responses are structured: https://www.eia.gov/opendata/browser/electricity/rto/region-data
@grgmiller That's a good point! Looking at the OGE code, it seems like you combine the interchange and balance tables in a long format and leave out the subregion table. Just flagging this so we can keep in mind as we consider different formats.
Do you have feelings about combining the the balance/subregion tables into one vs. keeping them separate?
No real preference here, I haven't ever really worked with the subregional demand data so not sure how it's structured.
For context, the tables referenced by the EIA API are essentially:
balance
table without the by-fuel-type net_gen valuesbalance
tablesubregion
tableinterchange
tablebalance
table without the by-fuel-type net_gen values aggregated by dayI think we want to reduce data duplication so we probably won't want the day-level aggregation table. We also probably won't need net_gen totals in the balance
table if we have the net_gen by-fuel subcomponents.
In the wide format that's most like the original data, what are the primary keys? Is it just the timestamp and the Balancing Authority code?
For most of our time series (CEMS, FERC-714, the new GridPath renewables profiles) we've gone with a format that has a timestamp, along with ID columns (e.g. respondent ID, or plant and generator IDs) as the primary key, and then a number of different value columns. I think there's a case to made that the "wide" format here (option 1) also follows this pattern, even though in this case the data columns mostly have the same units (MW)
If we reshape it to create a number of categorical columns (like fuel type and adjusted/forecasted/reported/imputed) but those categorical columns don't apply to all of the data columns, then we'll end up with NA values in the primary key, which creates problems, and we might want to break it out into multiple tables in that case, such that only columns which are associated with the same set of categories are grouped together in a table, but if there are lots of different combinations of categorical values which pertain to their own set of columns that seems overly complex.
So for the moment I think I might support option 1.
IIRC a lot of the fixes were being applied by timestamp + BA code, in which case those fixes would be more directly applicable to the wide format?
Agree that we probably don't want to store aggregations (daily totals, or all-fuel totals) in the core table. If those are desirable we can produce them in an output table.
what are the primary keys? Is it just the timestamp and the Balancing Authority code?
Yes, balancing_authority_code_eia
and report_datetime_utc
are the primary keys
If the subregion data has the same structure as the whole balancing area data with a timestamp + region code as the primary key, and is reporting a value which is also reported in by the balancing areas, then putting them in the same table seems reasonable. We did this with the renewable generation profiles since the wind, solar, and hybrid storage profiles all had exactly the same structure, and you can easily select which series or type of series you want based on the ID columns.
If we do this (combined balance
and subregion
tables), we could hypothetically remove the BA level demand data because it would be duplicative. However, we would have to add subregion as another primary key and then all the total values would be NA which would introduce NA primary keys (an issue you mentioned above).
The balancing area level data feels more "primary" to me than the totals by fuel or by day. It also covers a longer time period than the subregional data, and it would be nice to have the continuous timeseries for the balancing areas from 2015 to the present. So I would lean toward keeping all the years of balancing area data alongside the subregional data for the years in which it is available.
How do you recommend we combine the tables without NA primary keys? We could just have a sub region called "total"
I was imagining that in the early years, there just wouldn't be any records associated with the subregions.
Iteration with Ana/Elaine another users on what further cleaning to do will happen in another issue.
The EIA's Form 930 provides a good sub-regional hourly snapshot of the grid, including demand, generation, the breakdown of supply by energy source, and interchange between balancing authorities.
There are multiple open source pre-processing routines we can build upon, including:
Design Considerations