Open zaneselvans opened 1 year ago
I'm in a place to pick this up with consistent attention. For starters, the "bulk" data isn't a more usable form for ingesting EIA-176 data and related data. That leaves what I've come to think of as the "bundled" data as the source for this. (At some point I'd like more insight on how developers have discovered these data endpoints.) For next steps, I plan to put together a basic outline of extraction. That will inform the data model, potentially any additional requirements, and lead into the other tasks here. It looks like I can follow preexisting patterns of form-specific extraction logic in pudl/extract
, e.g., eia860.py
, eia923.py
.
The "bulk" data doesn't contain the company-level information from form EIA-176. Searching for a handful of attributes corresponding to the bundled EIA-176 company data (all_company_176.csv
) turned up no results, e.g. 17600032KS, ABBYVILLE, 17600033IA, MOULTON, VILLAGE. Trying to parse out aggregates of EIA-176 data looks like it would be gnarly, to the extent that it's even feasible or of much value.
On the whole the bulk data is pretty disorganized. It comprises many different types of series and no single column exhibits low-cardinality values that would easily separate them. Ideally I'd expect one or more column(s) clearly indicating withdrawals vs receipts, etc. To get those clean attributes/dimensions, one could potentially parse them from the "description" field or the "series_id" field once there's a clear mapping of the components of "series_id". (The "series_id" components clearly correspond to some semantic codes.) There's also the "name" field but that appears to largely be a noisier version of "description" also including the unit of time for the series, e.g., monthly. A couple thousand entries do not have a "description" value, which might actually be a data structuring issue.
Here are a some examples of the "series_id" and "description" values:
I took a quick and dirty pass at "description" keywords one might be able to use to decompose the 17,000+ series into clearer groupings. However, these conditions aren't mutually exclusive, so to get clean sets we'd need different groupings or extra logic.
Sweet, thanks for digging into the bulk data @davidmudrauskas - hope finding all these quirks was fun in some way :)
As for next steps, following in the footsteps of EIA 860 seems like a good start! Some more confusing bits that you might have already figured out:
Datastore
, defined in esrc/pudl/workspace/datastore.py
, which knows where to get the EIA 176 data from (either an online archive, or a number of caching layers). You'll need to update the ZenodoDoiSettings
class to know what to do about EIA 176 data. It looks like we already have a production and sandbox archive, so at least you don't have to worry about generating those 😅 Thanks! Yeah, I'd tracked down the DOI 😁 Looks like you all don't have generic CSV extracting yet so I've drafted a basic class for that, and I think I've found the entry points for the other major operations. Should have something to look at soon.
I have a decent idea going in a branch I'll push soon once I get pre-commit hooks resolved, then maybe I can get some feedback. Let me know and I can adjust the pace too.
If we do get a generic CSV extractor set up, all of the FERC-714 data from 2020 and earlier is stored as CSVs, and it could be applied there too.
Responding to this question from #3264 here since I think it's more general to our integration of the EIA's gas data:
EIA 176 zipfiles also bundle a few other forms - 191 and 757. Where do we want to extract and process these datasets? As separate modules, or as part of the EIA 176 extraction?
I wouldn't interpret the "EIA-176" label narrowly. There's nothing particularly special about that form, and as noted in #2603 initially we thought that all of the data in this bulk download zipfile was EIA-176, but it turns out there were these other associated forms related to gas production, storage, etc. IIRC in my initial digging around, it seemed like some of it looked usable, and some of it was both very messy and only had small number of years of data included. Almost like they accidentally dumped the data in this CSV once and then never looked at it again (which means the data is somewhere else, but... where?).
Rather than focusing on the forms in particular, I'd try and identify the subset of data that we've got archived and ready to process which are actually worth cleaning up and turning into tables -- like there's a significant amount of data and it's a tractable problem without a huge investment of manual effort. Or at least prioritize them in terms of "person-hours per unit of data integrated"
~Another thing that we should look at is how the data available from this obscure zipfile compares / relates to the bulk natural gas API data available from EIA~ Edit I see @davidmudrauskas already took a stab at this above and no dice!
My guess is that like the EIA-860 and EIA-923 spreadsheets, the data in this zipfile is (part of) what gets fed into the bulk data and more polished monthly/annual gas reports they publish, but that there's other data coming from other forms too, and the bulk API data and glossy reports probably do not reflect the full detail of the data that's in the original submitted form responses (whatever form they might take). I think we're probably looking for the long historical record of inputs that go into making their natural gas data products. Some of which looks like it goes back (from somewhere) as far as the 1970s.
The EIA Natural Gas Annual Report refers to forms EIA-176, EIA-895, and EIA-910 in Appendix A: "summary of data collection operations and report methodology" though weirdly EIA-895 doesn't show up on their big list of forms.
I assume that the the EIA-757 (Natural Gas Processing Plant Survey) will probably relate to some of the same pipeline infrastructure that's reported in PHMSA. Compressor stations, NGL precipitators, H2S removal facilities. I assume linking the facilities and their owners/operators between the EIA and PHMSA data will be another entity matching circus that we probably don't want to get into now, but since we're also working on the PHMSA pipelines extractions now, being able to care that data and EIA-757 at the same time might help us understand how they relate and map out a plan for future integrations.
Similarly I'd guess that the EIA-191 (monthly underground natural gas storage report) and EIA-191L (monthly LNG storage report) will have some relationship to the gas storage facility data that comes out of PHMSA, so those would be good to look at in tandem too, but I don't think we've gotten to the gas storage facility data in PHMSA yet, so maybe we don't prioritize EIA-191 yet if we have to choose.
One alternative method to access 176/191/757 data that may or may not work:
Excel: It appears from inspecting this site for 176 downloads that if you POST
to the URL https://www.eia.gov/naturalgas/ngqs/data/export/xls you're able to export an Excel spreadsheet of 176 data for the years requested for one of the subsections of 176 data. It's unclear exactly what needs to be POSTed.
CSV: Similarly but more simply, GET
ing this URL returns all EIA 176 company data in a JSON format from 1997-2022 (www.eia.gov/naturalgas/ngqs/data/report/RP6/data/1997/2022/ICA/ID_Name). This is also true for the 191 and 757 forms on the website. Data is available for some forms through 2023. This returns both ID and Name for the years and dataset selected.
Observations about this GET endpoint behavior:
In short, I think there are hidden endpoints here that don't involve hand mapping LINE columns and would produce hopefully more usable CSVs for raw data integration, and I'd like to explore these a bit more before we commit to the bundled data. The CSV extractor built to handle the bundled data should still be usable here.
@davidmudrauskas I know you had mentioned maybe starting to map LINE columns as a next step, so I'd suggest just pausing on that if you've started already since there may be a possibility this isn't necessary.
EIA-895 does show up on the ancient form page: https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s895
Notes on reviewing the freshly extracted data from #3402:
area
field should be made into a categorical that maps to our political jurisdictions table and the standard state abbreviations. It looks like there will be a couple of new values like "Federal Gulf of Mexico" that need to be created.atype
. What do those codes stand for? Looks like it indicates what type of quantity is being reported in the record...
VL
=> Volume (unit is ???)CT
=> Count (e.g. number of customers)CS
=> Sales Revenue (USD presumably?)YA
=> year end storage capacity (some kind of volume?)company
=> company_name
company
field also includes lots of "total" values. We'll probably want to remove these to avoid duplicate reporting of various quantities. If we can identify what groups of records they correspond to, they may be useful in validating the data for internal self-consistency. However, this kind of validation can be very tedious if it's hard to identify the row groups programmatically.id
column refer to, and can it actually be used with those foreign IDs?company
and id
-- almost 2/3 of the time, 2 company
values are associated with a single id
value. In the other direction it's much less unique. The same company name is frequently associated with many different IDs.id
maps to 2 different company
values, it seems to be that one of them is a record describing an individual company, and the other is a record for "total of all companies" So maybe it's just what happens when there's only a single company operating in the state? So the same ID ends up associated with both the company's record and the state / area total record?n_co = eia176.groupby("id").company.transform("nunique")
eia176.loc[n_co == 2, ["company", "id"]].drop_duplicates().sort_values("id")
item
appears to be a description that indicates the meaning of the line
field, but there are 85 different item
values, and only 58 different line
values, so there can't be a one-to-one mapping. Need to figure out how we can infer the variable being described in each record for reshaping.itemsort
field mean and do we need it? It seems closely related to the line
field.value
column is extremely non-homogeneous, and no indication of units are given in the data. Seems like the atype
column is the strongest indicator we have.year
and report_year
appear to be redundant, but report_year
shows up as a string
for some reason.base_case
id
values in this table match the id
values in the EIA-176? That would be amazing.report_date
column for this table with monthly resolution using our existing tooling.field_type
, reservoir_code
, status
(of what? the field?), region
.region
values correspond to? Are they composed of states? Is it census regions? Do we already have these compiled in the political subdivisions table? Do we need to create a new column that corresponids to these region codes?Several kinds of entities are being referenced in these tables, and could potentially be pulled out into their own separate tables and linked to the data via FKs:
What does the itemsort field mean and do we need it? It seems closely related to the line field.
This is a short code for the line numbers on the form, with some lines referring to a combination of two rows (e.g. [10.1 + 11.1]
). See p.16 of the the NGQS guide.
year and report_year appear to be redundant, but report_year shows up as a string for some reason.
We added report_year
manually in the extraction because year
wasn't reported in EIA 176 data but was determined by the URL call made during extraction - we should drop it for this dataset.
In terms of the lines and definitions, we'll want to refer to the NGQS guide - see Appendix A for definitions of all the atype
and line
and item
mappings, though I was hoping we wouldn't have to do this manually with all the fields.
Some observations on EIA 176 IDs:
This is out of date given current data availability, so I'm moving it out of the PR description and archiving it below:
Priority: High Source of raw data: Bundled (hopefully)
This table will include data on the company filling out the form, including their address, company characteristics and distribution territory. This will include Parts 1, 3, and 7 from EIA Form 176. The primary key should be EIA ID number, year, and state. Some of these characteristics could get harvested across states or years (e.g., company address) and characteristics, but this seems somewhat low-value add for now.
Priority: High Source of raw data: Bundled (hopefully)
This table will include data on a company's natural gas sources and dispositions, with each row representing one state's data. This will include Part 4 and Part 6 of Form EIA 176. This will give us information on international and cross-state gas transfers, numbers of residential, commercial, industrial etc end-use consumers receiving natural gas, and volumes and revenues associated with these dispositions. The primary key will be EIA ID number, year and state.
Ideally if a footnote in Part 7B is included, it should be attached in a column to the relevant data it refers to.
Priority: medium Source of data: ?? + Bundled
This table will include EIA 191L data (see above) and Part 5 of the EIA 176 data. The primary key should be EIA ID number, year, and facility. The data included from EIA 191L should be from the end of December, and the volume and capacity should be directly comparable in the table so they can be validated against one another. 191L data does not seem to be included in all_data_191.csv so it would need to be obtained through a separate source. As a first step, we could just include EIA 176 data for now.
Ideally if a footnote in 176 Part 7B is included, it should be attached in a column to the relevant data it refers to.
Priority: medium Source of data: Bundled
This table will include EIA 191 data. The primary key should be EIA ID number, year-month, and facility (field_name), with each row representing one facility's report in a given month. The EIA 191 form is released monthly. all_data_191.csv looks like it has actual column names that correspond to the form, so we won't have to deal with the frustrating LINE renames here.
Priority: medium Source of data: ??
This table will include EIA 191L data, which isn't part of the bulk electricity zipfile as far as I can tell. The primary key should be EIA ID number, year-month, and facility, with each row representing one facility's report in a given month.
Priority: medium Source of data: Bundled (incomplete) or ??
This table will include data from EIA Form 757 Schedule A. Schedule A, the Baseline Report is filled out no more often than every 3 years, and includes data on capacity, status, operations, and connecting infrastructure of natural gas processing plants. all_data_757.csv seems to only contain Part 1 and Part 5 of the report and does not include EIA ID, so we'd need to track down a downloadable form of the remaining table if we wanted to include it, which would probably involve using the API data. The primary key should be EIA ID, date of filing, and some identifying combination of plant fields (name and address, e.g.), with each row representing one plant's report.
Priority: medium Source of data: ??
This table will include data from EIA Form 757 Schedule B. Schedule B monitors post-emergency natural gas processing plans operational statuses. all_data_757.csv does not contain this data so we'd need to track down a downloadable form of the data if we wanted to include it, which would probably involve using the API data. Primary key should be EIA ID, date of filing, and some combination of plant identifiers as required, with each row representing one plant's report.
What tables are we sure we'll need?
core_eia176__yearly_natural_gas_sources_dispositions
A tidy / normalized version of everything in raw_eia176__data
that seems to be reported on an (ID, year, area) basis. I can imagine this table being broken out into separate thematic tables if the reshaping / normalization is simpler that way because the categorical values (e.g. customer type) don't apply to all the columns we've got after tidying. I think it'll be easier to identify the natural normalization of this data once it's been reshaped, so we should check in again at that point.raw_eia191__data
seems like it only contains core_eia191__monthly_natural_gas_storage
raw_eia757a__data
seems like it only contains core_eia757a__natural_gas_processing_plant
(not really yearly... irregular? But it's always a year)For most of our other datasets, we've called respondents "utilities" but it's not clear that applies here. What entity name do we want to use? respondent
? company
? Form respondent
seems like the most generic and generally applicable but isn't very descriptive.
If there isn't much information associated with a respondent, and it only appears in a single table, it's probably not worth breaking an entity table out. E.g. in the case of the eia176
table, it seems like all we have is a respondent ID and name, and the IDs don't show up in the 191 data or anywhere else that we know of yet, so leaving the IDs & names in that table even if there's some duplication doesn't seem so terrible.
core_eia176__{entity|yearly}_respondents
core_eia191__{entity|yearly}_respondents
core_eia191__{entity|yearly}_geologic_reservoirs
(Just saying _reservoirs
feels pretty vague)core_eia191__{entity_yearly}_gas_storage_fields
(Just having _fields
feels too vague)core_eia757a__{entity|yearly}_operators
(maybe operators and owners are actually both just respondents?core_eia757a__{entity|yearly}_owners
I've been looking at the EIA-191 data and the additions to it mentioned here and feel I can take on that work. Looks like the PK is defined already, and the columns that may require an ENUM or FK only have a few unique values there. One open question here for me is which, if any, fields need to be normalized and how (saw some mention of this in a previous comment).
I haven't looked at it in a bit, but it seemed like there was probably one one real data table to be made in the EIA-191, and that as of yet there wasn't much benefit to stripping out entity (respondent) fields to make a separate more normalized table. So I'd probably try making a single core table for now.
Description
Integrate the EIA-176 natural gas sources & dispositions data into PUDL.
What we've been calling the EIA-176 is actually 3 different related forms that are bundled together for bulk distribution:
Each of these forms uses the same IDs to refer to the reporting companies, and that shared company ID information is provided as a separate standalone table with the company ID, name, and activity status. Note that there is no date information associated with the company information, so the activity status probably just pertains to when the reporting was done, and there are no historical archives, so this field is pretty useless. Presumably we'll be able to guess which companies are active based on whether they're reporting data in the other tables?
There's also a lot more natural gas data available from EIA that we might be able to download in bulk from their API or other hidden endpoints.
Motivation
This gas source, disposition, storage, and processing plant data should help us target existing natural gas utilities and the capital locked up in existing infrastructure for early retirement, and may help advocates prevent new investments in natural gas facilities that would need to be decommissioned well before end of life to maintain a stable climate.
In Scope
We'll know we're done when:
Known data issues
row_id
in that the number corresponds to a particular variable reported on the form, and the mapping of number to variable evolves over time, so we'll need to do some kind of pivot of the data, and an alignment of those LINE number meanings across years. The complete list of LINE numbers can be found on p. 32 here, but has also certainly changed over time.~ This is out of date, see below for discussion ofline
anditemsort
in newly extracted data.