catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Integrate EIA 176, 191 and 757A into PUDL #2603

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

Description

Integrate the EIA-176 natural gas sources & dispositions data into PUDL.

What we've been calling the EIA-176 is actually 3 different related forms that are bundled together for bulk distribution:

Each of these forms uses the same IDs to refer to the reporting companies, and that shared company ID information is provided as a separate standalone table with the company ID, name, and activity status. Note that there is no date information associated with the company information, so the activity status probably just pertains to when the reporting was done, and there are no historical archives, so this field is pretty useless. Presumably we'll be able to guess which companies are active based on whether they're reporting data in the other tables?

There's also a lot more natural gas data available from EIA that we might be able to download in bulk from their API or other hidden endpoints.

Motivation

This gas source, disposition, storage, and processing plant data should help us target existing natural gas utilities and the capital locked up in existing infrastructure for early retirement, and may help advocates prevent new investments in natural gas facilities that would need to be decommissioned well before end of life to maintain a stable climate.

In Scope

We'll know we're done when:

Known data issues

### Initial data selection and archiving
- [x] Check whether bulk EIA Natural Gas downloads contain all of this data in a more usable form.
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/262
### Extraction
- [x] Create an EIA-176 extractor module that can load these CSV files into dataframes.
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3339
- [ ] #562
- [x] Double check whether EIA 757A data can be exported with IDs - it can't
### EIA-176 Data Transformation
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3501
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3555
- [ ] Meet: Once the table is in a relatively tidy format with a PK of `(report_year, area, id)`, decide if there are other repeated categorical attributes of the columns that should be pulled out into a new element of the PK, e.g. a `customer_type` column that includes residential, commercial, industrial, and transport categories, in which the same measured variable (e.g. number of customers) is being reported for each category. Decide whether to split the one table into multiple tables with different PKs.
- [ ] Standardize units to match other PUDL tables (e.g. convert thousands of dollars to dollars), standardize NA values (we use `pd.NA` and the nullable pandas types for everything but floating point data, which uses `np.nan`)
- [ ] Define/Describe all new fields in `pudl.metadata.fields` and define a datatype for each field. Each field should have units, and these should be found in the PDF form if not specified in the raw data.
- [ ] Should we retain the `atype` column and clarify what the short codes mean, or does it become superfluous once the table has been reshaped into a tidy format?
- [ ] Convert the `area` column to use standardized US state abbreviations (or other standard short codes for the other non-US states). Integrate with our existing political subdivisions table.
- [ ] Use the `Total of All Companies` entries to verify sums by (area, year) during transform, and then drop the total records.
- [ ] Define data validations as `asset_checks` in dagster - e.g., expected row counts, reasonable values, no entirely null columns
- [ ] Are there any items in the original filings that aren't in our exported data? Are any of them vital? Is there any way to get that data programmatically?
### EIA-191 Data Transformation
- [ ] Define the meaning, units, and types of all the new columns in `pudl.metadata.fields`
- [ ] Construct a `report_date` column with monthly granularity based on the `report_year` and `report_month` columns (we have a helper function for this)
- [ ] Identify the natural primary key of this table.
- [ ] Define `snake_case` categorical values for `field_type` and `status` and enforce either with an ENUM constraint, or a FK relationship to a new coding table.
- [ ] Define a new coding table for the `reservoir_code` values explaining what they mean and use a FK relationship to constrain the values.
- [ ] Infer FIPS code for state and county using `pudl.helpers.add_fips_ids()` See `pudl.transform.eia861` for example usage.
- [ ] Figure out the meaning of `region` and whether we can/should integrate it with our existing political subdivisions table.
- [ ] Identify what other set of codes the `gas_field_code` refers to and check that it can be used with this data meaningfully.
- [ ] Define data validations as asset_checks in dagster - e.g., expected row counts, reasonable values, no entirely null columns
### Cleaning and integration - EIA 757A
- [ ] Define the meaning, units, and types of all the new columns in `pudl.metadata.fields`
- [ ] Infer FIPS code for state and county using `pudl.helpers.add_fips_ids()` See `pudl.transform.eia861` for example usage.
### Integrate data from across EIA datasets
- [ ] Ask EIA how their codes work, and if we can combine entities across these tables
- [ ] Create an entity table harvesting company data from EIA 176 and 191 (at least `company_id` and `company_name`). Verify that the same IDs are being used to identify the same companies in both Forms 176 and 191. Are the names consistent across all years of data? If so then it's an entity table. If not, then it's an annually varying table.
- [ ] Identify duplicated entity information being reported in the EIA-191, and determine whether it's worth doing further normalization to create separate `field` or `reservoir` tables
- [ ] Split zipcode into 5 digit column + optional 4 digit column (as strings with leading zeroes)
davidmudrauskas commented 9 months ago

I'm in a place to pick this up with consistent attention. For starters, the "bulk" data isn't a more usable form for ingesting EIA-176 data and related data. That leaves what I've come to think of as the "bundled" data as the source for this. (At some point I'd like more insight on how developers have discovered these data endpoints.) For next steps, I plan to put together a basic outline of extraction. That will inform the data model, potentially any additional requirements, and lead into the other tasks here. It looks like I can follow preexisting patterns of form-specific extraction logic in pudl/extract, e.g., eia860.py, eia923.py.

Elaboration on bulk data

The "bulk" data doesn't contain the company-level information from form EIA-176. Searching for a handful of attributes corresponding to the bundled EIA-176 company data (all_company_176.csv) turned up no results, e.g. 17600032KS, ABBYVILLE, 17600033IA, MOULTON, VILLAGE. Trying to parse out aggregates of EIA-176 data looks like it would be gnarly, to the extent that it's even feasible or of much value.

On the whole the bulk data is pretty disorganized. It comprises many different types of series and no single column exhibits low-cardinality values that would easily separate them. Ideally I'd expect one or more column(s) clearly indicating withdrawals vs receipts, etc. To get those clean attributes/dimensions, one could potentially parse them from the "description" field or the "series_id" field once there's a clear mapping of the components of "series_id". (The "series_id" components clearly correspond to some semantic codes.) There's also the "name" field but that appears to largely be a noisier version of "description" also including the unit of time for the series, e.g., monthly. A couple thousand entries do not have a "description" value, which might actually be a data structuring issue.

Here are a some examples of the "series_id" and "description" values:

I took a quick and dirty pass at "description" keywords one might be able to use to decompose the 17,000+ series into clearer groupings. However, these conditions aren't mutually exclusive, so to get clean sets we'd need different groupings or extra logic.

jdangerx commented 9 months ago

Sweet, thanks for digging into the bulk data @davidmudrauskas - hope finding all these quirks was fun in some way :)

As for next steps, following in the footsteps of EIA 860 seems like a good start! Some more confusing bits that you might have already figured out:

davidmudrauskas commented 9 months ago

Thanks! Yeah, I'd tracked down the DOI 😁 Looks like you all don't have generic CSV extracting yet so I've drafted a basic class for that, and I think I've found the entry points for the other major operations. Should have something to look at soon.

davidmudrauskas commented 9 months ago

I have a decent idea going in a branch I'll push soon once I get pre-commit hooks resolved, then maybe I can get some feedback. Let me know and I can adjust the pace too.

zaneselvans commented 7 months ago

If we do get a generic CSV extractor set up, all of the FERC-714 data from 2020 and earlier is stored as CSVs, and it could be applied there too.

zaneselvans commented 5 months ago

Responding to this question from #3264 here since I think it's more general to our integration of the EIA's gas data:

EIA 176 zipfiles also bundle a few other forms - 191 and 757. Where do we want to extract and process these datasets? As separate modules, or as part of the EIA 176 extraction?

I wouldn't interpret the "EIA-176" label narrowly. There's nothing particularly special about that form, and as noted in #2603 initially we thought that all of the data in this bulk download zipfile was EIA-176, but it turns out there were these other associated forms related to gas production, storage, etc. IIRC in my initial digging around, it seemed like some of it looked usable, and some of it was both very messy and only had small number of years of data included. Almost like they accidentally dumped the data in this CSV once and then never looked at it again (which means the data is somewhere else, but... where?).

Rather than focusing on the forms in particular, I'd try and identify the subset of data that we've got archived and ready to process which are actually worth cleaning up and turning into tables -- like there's a significant amount of data and it's a tractable problem without a huge investment of manual effort. Or at least prioritize them in terms of "person-hours per unit of data integrated"

~Another thing that we should look at is how the data available from this obscure zipfile compares / relates to the bulk natural gas API data available from EIA~ Edit I see @davidmudrauskas already took a stab at this above and no dice!

My guess is that like the EIA-860 and EIA-923 spreadsheets, the data in this zipfile is (part of) what gets fed into the bulk data and more polished monthly/annual gas reports they publish, but that there's other data coming from other forms too, and the bulk API data and glossy reports probably do not reflect the full detail of the data that's in the original submitted form responses (whatever form they might take). I think we're probably looking for the long historical record of inputs that go into making their natural gas data products. Some of which looks like it goes back (from somewhere) as far as the 1970s.

The EIA Natural Gas Annual Report refers to forms EIA-176, EIA-895, and EIA-910 in Appendix A: "summary of data collection operations and report methodology" though weirdly EIA-895 doesn't show up on their big list of forms.

I assume that the the EIA-757 (Natural Gas Processing Plant Survey) will probably relate to some of the same pipeline infrastructure that's reported in PHMSA. Compressor stations, NGL precipitators, H2S removal facilities. I assume linking the facilities and their owners/operators between the EIA and PHMSA data will be another entity matching circus that we probably don't want to get into now, but since we're also working on the PHMSA pipelines extractions now, being able to care that data and EIA-757 at the same time might help us understand how they relate and map out a plan for future integrations.

Similarly I'd guess that the EIA-191 (monthly underground natural gas storage report) and EIA-191L (monthly LNG storage report) will have some relationship to the gas storage facility data that comes out of PHMSA, so those would be good to look at in tandem too, but I don't think we've gotten to the gas storage facility data in PHMSA yet, so maybe we don't prioritize EIA-191 yet if we have to choose.

e-belfer commented 5 months ago

One alternative method to access 176/191/757 data that may or may not work:

Excel: It appears from inspecting this site for 176 downloads that if you POST to the URL https://www.eia.gov/naturalgas/ngqs/data/export/xls you're able to export an Excel spreadsheet of 176 data for the years requested for one of the subsections of 176 data. It's unclear exactly what needs to be POSTed.

CSV: Similarly but more simply, GETing this URL returns all EIA 176 company data in a JSON format from 1997-2022 (www.eia.gov/naturalgas/ngqs/data/report/RP6/data/1997/2022/ICA/ID_Name). This is also true for the 191 and 757 forms on the website. Data is available for some forms through 2023. This returns both ID and Name for the years and dataset selected.

Observations about this GET endpoint behavior:

In short, I think there are hidden endpoints here that don't involve hand mapping LINE columns and would produce hopefully more usable CSVs for raw data integration, and I'd like to explore these a bit more before we commit to the bundled data. The CSV extractor built to handle the bundled data should still be usable here.

@davidmudrauskas I know you had mentioned maybe starting to map LINE columns as a next step, so I'd suggest just pausing on that if you've started already since there may be a possibility this isn't necessary.

zaneselvans commented 5 months ago

EIA-895 does show up on the ancient form page: https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s895

zaneselvans commented 3 months ago

Notes on reviewing the freshly extracted data from #3402:

EIA-176

n_co = eia176.groupby("id").company.transform("nunique")
eia176.loc[n_co == 2, ["company", "id"]].drop_duplicates().sort_values("id")

EIA-191

EIA-757A

Entities

Several kinds of entities are being referenced in these tables, and could potentially be pulled out into their own separate tables and linked to the data via FKs:

e-belfer commented 3 months ago

What does the itemsort field mean and do we need it? It seems closely related to the line field.

This is a short code for the line numbers on the form, with some lines referring to a combination of two rows (e.g. [10.1 + 11.1]). See p.16 of the the NGQS guide.

year and report_year appear to be redundant, but report_year shows up as a string for some reason.

We added report_year manually in the extraction because year wasn't reported in EIA 176 data but was determined by the URL call made during extraction - we should drop it for this dataset.

In terms of the lines and definitions, we'll want to refer to the NGQS guide - see Appendix A for definitions of all the atype and line and item mappings, though I was hoping we wouldn't have to do this manually with all the fields.

e-belfer commented 3 months ago

Some observations on EIA 176 IDs:

e-belfer commented 3 months ago

This is out of date given current data availability, so I'm moving it out of the PR description and archiving it below:

Proposed table design

core_eia176__yearly_companies

Priority: High Source of raw data: Bundled (hopefully)

This table will include data on the company filling out the form, including their address, company characteristics and distribution territory. This will include Parts 1, 3, and 7 from EIA Form 176. The primary key should be EIA ID number, year, and state. Some of these characteristics could get harvested across states or years (e.g., company address) and characteristics, but this seems somewhat low-value add for now.

core_eia176__yearly_sources_dispositions

Priority: High Source of raw data: Bundled (hopefully)

This table will include data on a company's natural gas sources and dispositions, with each row representing one state's data. This will include Part 4 and Part 6 of Form EIA 176. This will give us information on international and cross-state gas transfers, numbers of residential, commercial, industrial etc end-use consumers receiving natural gas, and volumes and revenues associated with these dispositions. The primary key will be EIA ID number, year and state.

Ideally if a footnote in Part 7B is included, it should be attached in a column to the relevant data it refers to.

core_eia__yearly_liquefied_natural_gas_storage

Priority: medium Source of data: ?? + Bundled

This table will include EIA 191L data (see above) and Part 5 of the EIA 176 data. The primary key should be EIA ID number, year, and facility. The data included from EIA 191L should be from the end of December, and the volume and capacity should be directly comparable in the table so they can be validated against one another. 191L data does not seem to be included in all_data_191.csv so it would need to be obtained through a separate source. As a first step, we could just include EIA 176 data for now.

Ideally if a footnote in 176 Part 7B is included, it should be attached in a column to the relevant data it refers to.

core_eia191__monthly_underground_storage

Priority: medium Source of data: Bundled

This table will include EIA 191 data. The primary key should be EIA ID number, year-month, and facility (field_name), with each row representing one facility's report in a given month. The EIA 191 form is released monthly. all_data_191.csv looks like it has actual column names that correspond to the form, so we won't have to deal with the frustrating LINE renames here.

core_eia191__monthly_liquefied_natural_gas_storage

Priority: medium Source of data: ??

This table will include EIA 191L data, which isn't part of the bulk electricity zipfile as far as I can tell. The primary key should be EIA ID number, year-month, and facility, with each row representing one facility's report in a given month.

core_eia757__{freq}_processing_plants

Priority: medium Source of data: Bundled (incomplete) or ??

This table will include data from EIA Form 757 Schedule A. Schedule A, the Baseline Report is filled out no more often than every 3 years, and includes data on capacity, status, operations, and connecting infrastructure of natural gas processing plants. all_data_757.csv seems to only contain Part 1 and Part 5 of the report and does not include EIA ID, so we'd need to track down a downloadable form of the remaining table if we wanted to include it, which would probably involve using the API data. The primary key should be EIA ID, date of filing, and some identifying combination of plant fields (name and address, e.g.), with each row representing one plant's report.

core_eia757__{freq}_processing_disruptions

Priority: medium Source of data: ??

This table will include data from EIA Form 757 Schedule B. Schedule B monitors post-emergency natural gas processing plans operational statuses. all_data_757.csv does not contain this data so we'd need to track down a downloadable form of the data if we wanted to include it, which would probably involve using the API data. Primary key should be EIA ID, date of filing, and some combination of plant identifiers as required, with each row representing one plant's report.

zaneselvans commented 3 months ago

What tables are we sure we'll need?

Data Tables

Entity Tables

For most of our other datasets, we've called respondents "utilities" but it's not clear that applies here. What entity name do we want to use? respondent? company? Form respondent seems like the most generic and generally applicable but isn't very descriptive.

If there isn't much information associated with a respondent, and it only appears in a single table, it's probably not worth breaking an entity table out. E.g. in the case of the eia176 table, it seems like all we have is a respondent ID and name, and the IDs don't show up in the 191 data or anywhere else that we know of yet, so leaving the IDs & names in that table even if there's some duplication doesn't seem so terrible.

denimalpaca commented 1 month ago

I've been looking at the EIA-191 data and the additions to it mentioned here and feel I can take on that work. Looks like the PK is defined already, and the columns that may require an ENUM or FK only have a few unique values there. One open question here for me is which, if any, fields need to be normalized and how (saw some mention of this in a previous comment).

zaneselvans commented 1 month ago

I haven't looked at it in a bit, but it seemed like there was probably one one real data table to be made in the EIA-191, and that as of yet there wasn't much benefit to stripping out entity (respondent) fields to make a separate more normalized table. So I'd probably try making a single core table for now.