catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.

https://catalyst.coop/pudl

MIT License

456 stars 105 forks source link

Integrate EIA 176, 191 and 757A into PUDL #2603

Open zaneselvans opened 1 year ago

zaneselvans commented 1 year ago

Description

Integrate the EIA-176 natural gas sources & dispositions data into PUDL.

What we've been calling the EIA-176 is actually 3 different related forms that are bundled together for bulk distribution:

EIA-176: Annual Report of Natural and Supplemental Gas Supply & Disposition. Data collected include the origin of natural gas supplies and the disposition of natural gas on a State basis. Respondents include natural and synthetic gas producers, processors, distributors, storage.
EIA-191: Monthly Underground Natural Gas Storage Report. Working and base gas in reservoirs, injections, withdrawals, and location of reservoirs are reported by operators of all underground natural gas storage fields on a monthly basis.
EIA-757: Natural Gas Processing Plant Survey. Detailed plant-level information is collected on the capacity, status, and operations of natural gas processing plants to monitor supply constraints during periods of disruption due to a natural disaster, such as a hurricane.

Each of these forms uses the same IDs to refer to the reporting companies, and that shared company ID information is provided as a separate standalone table with the company ID, name, and activity status. Note that there is no date information associated with the company information, so the activity status probably just pertains to when the reporting was done, and there are no historical archives, so this field is pretty useless. Presumably we'll be able to guess which companies are active based on whether they're reporting data in the other tables?

There's also a lot more natural gas data available from EIA that we might be able to download in bulk from their API or other hidden endpoints.

Motivation

This gas source, disposition, storage, and processing plant data should help us target existing natural gas utilities and the capital locked up in existing infrastructure for early retirement, and may help advocates prevent new investments in natural gas facilities that would need to be decommissioned well before end of life to maintain a stable climate.

In Scope

We'll know we're done when:

EIA-176, 191, and 757 data are loaded into the PUDL DB by the main DAG. It looks like this will probably result in at least 3 core data tables: sources & dispositions (176), underground storage (191), and gas processing facilities (757A), and may also include some number of entity tables describing the companies/respondents, fields, reservoirs, facilities, etc. if warranted -- i.e. if there's enough data duplication or enough attributes we'd like to separate out in association with one of these entities.
Tables have well-defined schemas, including data types and descriptions for all columns.
Tables are in a tidy format, with each column representing a single variable, and each row representing a single observation.
Updates to data should be easy, and as automated as practicable, alongside our other existing data.
Tables should be fully normalized and cleaned, and should be easily accessed within our existing frameworks for incremental improvement and integration with other data sources, or use in future projects.

Known data issues

~The EIA-176 and EIA-176 "other" tables have some cryptic codes and "LINE" numbers in them. Based on #562 it seems that LINE is similar to FERC's row_id in that the number corresponds to a particular variable reported on the form, and the mapping of number to variable evolves over time, so we'll need to do some kind of pivot of the data, and an alignment of those LINE number meanings across years. The complete list of LINE numbers can be found on p. 32 here, but has also certainly changed over time.~ This is out of date, see below for discussion of line and itemsort in newly extracted data.

### Initial data selection and archiving
- [x] Check whether bulk EIA Natural Gas downloads contain all of this data in a more usable form.
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/262

### Extraction
- [x] Create an EIA-176 extractor module that can load these CSV files into dataframes.
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3339
- [ ] #562
- [x] Double check whether EIA 757A data can be exported with IDs - it can't

### EIA-176 Data Transformation
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3501
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3555
- [ ] Meet: Once the table is in a relatively tidy format with a PK of `(report_year, area, id)`, decide if there are other repeated categorical attributes of the columns that should be pulled out into a new element of the PK, e.g. a `customer_type` column that includes residential, commercial, industrial, and transport categories, in which the same measured variable (e.g. number of customers) is being reported for each category. Decide whether to split the one table into multiple tables with different PKs.
- [ ] Standardize units to match other PUDL tables (e.g. convert thousands of dollars to dollars), standardize NA values (we use `pd.NA` and the nullable pandas types for everything but floating point data, which uses `np.nan`)
- [ ] Define/Describe all new fields in `pudl.metadata.fields` and define a datatype for each field. Each field should have units, and these should be found in the PDF form if not specified in the raw data.
- [ ] Should we retain the `atype` column and clarify what the short codes mean, or does it become superfluous once the table has been reshaped into a tidy format?
- [ ] Convert the `area` column to use standardized US state abbreviations (or other standard short codes for the other non-US states). Integrate with our existing political subdivisions table.
- [ ] Use the `Total of All Companies` entries to verify sums by (area, year) during transform, and then drop the total records.
- [ ] Define data validations as `asset_checks` in dagster - e.g., expected row counts, reasonable values, no entirely null columns
- [ ] Are there any items in the original filings that aren't in our exported data? Are any of them vital? Is there any way to get that data programmatically?

### EIA-191 Data Transformation
- [ ] Define the meaning, units, and types of all the new columns in `pudl.metadata.fields`
- [ ] Construct a `report_date` column with monthly granularity based on the `report_year` and `report_month` columns (we have a helper function for this)
- [ ] Identify the natural primary key of this table.
- [ ] Define `snake_case` categorical values for `field_type` and `status` and enforce either with an ENUM constraint, or a FK relationship to a new coding table.
- [ ] Define a new coding table for the `reservoir_code` values explaining what they mean and use a FK relationship to constrain the values.
- [ ] Infer FIPS code for state and county using `pudl.helpers.add_fips_ids()` See `pudl.transform.eia861` for example usage.
- [ ] Figure out the meaning of `region` and whether we can/should integrate it with our existing political subdivisions table.
- [ ] Identify what other set of codes the `gas_field_code` refers to and check that it can be used with this data meaningfully.
- [ ] Define data validations as asset_checks in dagster - e.g., expected row counts, reasonable values, no entirely null columns

### Cleaning and integration - EIA 757A
- [ ] Define the meaning, units, and types of all the new columns in `pudl.metadata.fields`
- [ ] Infer FIPS code for state and county using `pudl.helpers.add_fips_ids()` See `pudl.transform.eia861` for example usage.

### Integrate data from across EIA datasets
- [ ] Ask EIA how their codes work, and if we can combine entities across these tables
- [ ] Create an entity table harvesting company data from EIA 176 and 191 (at least `company_id` and `company_name`). Verify that the same IDs are being used to identify the same companies in both Forms 176 and 191. Are the names consistent across all years of data? If so then it's an entity table. If not, then it's an annually varying table.
- [ ] Identify duplicated entity information being reported in the EIA-191, and determine whether it's worth doing further normalization to create separate `field` or `reservoir` tables
- [ ] Split zipcode into 5 digit column + optional 4 digit column (as strings with leading zeroes)

davidmudrauskas commented 9 months ago

I'm in a place to pick this up with consistent attention. For starters, the "bulk" data isn't a more usable form for ingesting EIA-176 data and related data. That leaves what I've come to think of as the "bundled" data as the source for this. (At some point I'd like more insight on how developers have discovered these data endpoints.) For next steps, I plan to put together a basic outline of extraction. That will inform the data model, potentially any additional requirements, and lead into the other tasks here. It looks like I can follow preexisting patterns of form-specific extraction logic in pudl/extract, e.g., eia860.py, eia923.py.

Elaboration on bulk data

The "bulk" data doesn't contain the company-level information from form EIA-176. Searching for a handful of attributes corresponding to the bundled EIA-176 company data (all_company_176.csv) turned up no results, e.g. 17600032KS, ABBYVILLE, 17600033IA, MOULTON, VILLAGE. Trying to parse out aggregates of EIA-176 data looks like it would be gnarly, to the extent that it's even feasible or of much value.

On the whole the bulk data is pretty disorganized. It comprises many different types of series and no single column exhibits low-cardinality values that would easily separate them. Ideally I'd expect one or more column(s) clearly indicating withdrawals vs receipts, etc. To get those clean attributes/dimensions, one could potentially parse them from the "description" field or the "series_id" field once there's a clear mapping of the components of "series_id". (The "series_id" components clearly correspond to some semantic codes.) There's also the "name" field but that appears to largely be a noisier version of "description" also including the unit of time for the series, e.g., monthly. A couple thousand entries do not have a "description" value, which might actually be a data structuring issue.

Here are a some examples of the "series_id" and "description" values:

NG.NGM_EPG0_IML_NUS-NYE_MMCF.M U.S. Liquefied Natural Gas Imports From Yemen
NG.NGM_EPG0_EVE_NUS-NKS_DMCF.M Price of Liquefied U.S. Natural Gas Exports by Vessels to South Korea
NG.RNGC2.D Natural Gas Futures Contract 2
NG.NGM_EPG0_SAH_SAK_PER.M Alaska Natural Gas in Underground Storage - Change in Working Gas from Same Month Previous Year
NG.N5050832.MM East Region Natural Gas Injections into Underground Storage
NG.NGM_EPG0_MIR_SUT-SNNGM_MMCF.AM Utah Natural Gas Interstate Receipts From New Mexico
NG.E_ERTW0_XWWR_NUS_MDW.AM U.S. Real Cost per Crude Oil, Natural Gas, and Dry Well Drilled

I took a quick and dirty pass at "description" keywords one might be able to use to decompose the 17,000+ series into clearer groupings. However, these conditions aren't mutually exclusive, so to get clean sets we'd need different groupings or extra logic.

imports -- select count(*) from eia176_bulk where description like "%imports%"; -- 616
exports
contract
storage
deliveries
injections
receipts
drilled
surveying
supplemental
consumption
reserves
withdrawals
discoveries
production
value
price
capacity
processed
sales
vented

jdangerx commented 9 months ago

Sweet, thanks for digging into the bulk data @davidmudrauskas - hope finding all these quirks was fun in some way :)

As for next steps, following in the footsteps of EIA 860 seems like a good start! Some more confusing bits that you might have already figured out:

the extractor needs to know about a Datastore, defined in esrc/pudl/workspace/datastore.py, which knows where to get the EIA 176 data from (either an online archive, or a number of caching layers). You'll need to update the ZenodoDoiSettings class to know what to do about EIA 176 data. It looks like we already have a production and sandbox archive, so at least you don't have to worry about generating those 😅
I think that was actually the main bit I could think of off the top of my head, but I'm sure there's more. Let me know if you run into any issues!

davidmudrauskas commented 9 months ago

Thanks! Yeah, I'd tracked down the DOI 😁 Looks like you all don't have generic CSV extracting yet so I've drafted a basic class for that, and I think I've found the entry points for the other major operations. Should have something to look at soon.

davidmudrauskas commented 9 months ago

I have a decent idea going in a branch I'll push soon once I get pre-commit hooks resolved, then maybe I can get some feedback. Let me know and I can adjust the pace too.

zaneselvans commented 7 months ago

If we do get a generic CSV extractor set up, all of the FERC-714 data from 2020 and earlier is stored as CSVs, and it could be applied there too.

zaneselvans commented 5 months ago

Responding to this question from #3264 here since I think it's more general to our integration of the EIA's gas data:

EIA 176 zipfiles also bundle a few other forms - 191 and 757. Where do we want to extract and process these datasets? As separate modules, or as part of the EIA 176 extraction?

I wouldn't interpret the "EIA-176" label narrowly. There's nothing particularly special about that form, and as noted in #2603 initially we thought that all of the data in this bulk download zipfile was EIA-176, but it turns out there were these other associated forms related to gas production, storage, etc. IIRC in my initial digging around, it seemed like some of it looked usable, and some of it was both very messy and only had small number of years of data included. Almost like they accidentally dumped the data in this CSV once and then never looked at it again (which means the data is somewhere else, but... where?).

Rather than focusing on the forms in particular, I'd try and identify the subset of data that we've got archived and ready to process which are actually worth cleaning up and turning into tables -- like there's a significant amount of data and it's a tractable problem without a huge investment of manual effort. Or at least prioritize them in terms of "person-hours per unit of data integrated"

~Another thing that we should look at is how the data available from this obscure zipfile compares / relates to the bulk natural gas API data available from EIA~ Edit I see @davidmudrauskas already took a stab at this above and no dice!

My guess is that like the EIA-860 and EIA-923 spreadsheets, the data in this zipfile is (part of) what gets fed into the bulk data and more polished monthly/annual gas reports they publish, but that there's other data coming from other forms too, and the bulk API data and glossy reports probably do not reflect the full detail of the data that's in the original submitted form responses (whatever form they might take). I think we're probably looking for the long historical record of inputs that go into making their natural gas data products. Some of which looks like it goes back (from somewhere) as far as the 1970s.

The EIA Natural Gas Annual Report refers to forms EIA-176, EIA-895, and EIA-910 in Appendix A: "summary of data collection operations and report methodology" though weirdly EIA-895 doesn't show up on their big list of forms.

I assume that the the EIA-757 (Natural Gas Processing Plant Survey) will probably relate to some of the same pipeline infrastructure that's reported in PHMSA. Compressor stations, NGL precipitators, H2S removal facilities. I assume linking the facilities and their owners/operators between the EIA and PHMSA data will be another entity matching circus that we probably don't want to get into now, but since we're also working on the PHMSA pipelines extractions now, being able to care that data and EIA-757 at the same time might help us understand how they relate and map out a plan for future integrations.

Similarly I'd guess that the EIA-191 (monthly underground natural gas storage report) and EIA-191L (monthly LNG storage report) will have some relationship to the gas storage facility data that comes out of PHMSA, so those would be good to look at in tandem too, but I don't think we've gotten to the gas storage facility data in PHMSA yet, so maybe we don't prioritize EIA-191 yet if we have to choose.

e-belfer commented 5 months ago

One alternative method to access 176/191/757 data that may or may not work:

Excel: It appears from inspecting this site for 176 downloads that if you POST to the URL https://www.eia.gov/naturalgas/ngqs/data/export/xls you're able to export an Excel spreadsheet of 176 data for the years requested for one of the subsections of 176 data. It's unclear exactly what needs to be POSTed.

CSV: Similarly but more simply, GETing this URL returns all EIA 176 company data in a JSON format from 1997-2022 (www.eia.gov/naturalgas/ngqs/data/report/RP6/data/1997/2022/ICA/ID_Name). This is also true for the 191 and 757 forms on the website. Data is available for some forms through 2023. This returns both ID and Name for the years and dataset selected.

Observations about this GET endpoint behavior:

Changing RP# changes the section of data provided, corresponding to the number of the item in the list on the Natural Gas Respondent Query system download page. (RP1-9).
Changing the two years changes the data range, and each form will have different combinations (e.g., 757 is released every 3 years). We could make annual/tri-annual CSVs, or just get one file for all years available.

In short, I think there are hidden endpoints here that don't involve hand mapping LINE columns and would produce hopefully more usable CSVs for raw data integration, and I'd like to explore these a bit more before we commit to the bundled data. The CSV extractor built to handle the bundled data should still be usable here.

@davidmudrauskas I know you had mentioned maybe starting to map LINE columns as a next step, so I'd suggest just pausing on that if you've started already since there may be a possibility this isn't necessary.

zaneselvans commented 5 months ago

EIA-895 does show up on the ancient form page: https://www.eia.gov/dnav/ng/TblDefs/NG_DataSources.html#s895

zaneselvans commented 3 months ago

Notes on reviewing the freshly extracted data from #3402:

EIA-176

Across these 3 forms, EIA-176 has about 900k or 95% of the records. By far the largest of these gas datasets.
It's also the least tidy in its extracted form. Each record is associated with a line number, analogous to the way the old FERC DBF files are organized, so there will need to be a bunch of reshaping to bring the table into a tidy / normalized form, with each record representing a single observation, and each column a separate variable.
area field should be made into a categorical that maps to our political jurisdictions table and the standard state abbreviations. It looks like there will be a couple of new values like "Federal Gulf of Mexico" that need to be created.
Need a coding table for atype. What do those codes stand for? Looks like it indicates what type of quantity is being reported in the record...
- VL => Volume (unit is ???)
- CT => Count (e.g. number of customers)
- CS => Sales Revenue (USD presumably?)
- YA => year end storage capacity (some kind of volume?)
company => company_name
Note that the company field also includes lots of "total" values. We'll probably want to remove these to avoid duplicate reporting of various quantities. If we can identify what groups of records they correspond to, they may be useful in validating the data for internal self-consistency. However, this kind of validation can be very tedious if it's hard to identify the row groups programmatically.
What other universe of IDs does the id column refer to, and can it actually be used with those foreign IDs?
Note that there is not a one-to-one correspondence between company and id -- almost 2/3 of the time, 2 company values are associated with a single id value. In the other direction it's much less unique. The same company name is frequently associated with many different IDs.
In the cases where a single id maps to 2 different company values, it seems to be that one of them is a record describing an individual company, and the other is a record for "total of all companies" So maybe it's just what happens when there's only a single company operating in the state? So the same ID ends up associated with both the company's record and the state / area total record?

n_co = eia176.groupby("id").company.transform("nunique")
eia176.loc[n_co == 2, ["company", "id"]].drop_duplicates().sort_values("id")

item appears to be a description that indicates the meaning of the line field, but there are 85 different item values, and only 58 different line values, so there can't be a one-to-one mapping. Need to figure out how we can infer the variable being described in each record for reshaping.
What does the itemsort field mean and do we need it? It seems closely related to the line field.
What are the units of all the different quantities being reported? The value column is extremely non-homogeneous, and no indication of units are given in the data. Seems like the atype column is the strongest indicator we have.

EIA-191

Only 45k records here.
Appears to include LNG facilities as well as gaseous storage, so maybe the EIA-191L is included here.
year and report_year appear to be redundant, but report_year shows up as a string for some reason.
Need some column definitions and units. What is base_case
OMFG is it possible that the id values in this table match the id values in the EIA-176? That would be amazing.
How consistent are the field names across time? Should we normalize these strings and the company names?
Need to construct a report_date column for this table with monthly resolution using our existing tooling.
Need coding tables (or possibly ENUM constraints) for: field_type, reservoir_code, status (of what? the field?), region.
Use state and county name to infer the FIPS code for use with geospatial data.
What do the region values correspond to? Are they composed of states? Is it census regions? Do we already have these compiled in the political subdivisions table? Do we need to create a new column that corresponids to these region codes?

EIA-757A

Only 1500 records, another order of magnitude smaller than the above.
Sadly there are no IDs here, only company names, and two different kinds of companies: owners and operators.
State and county should be used to infer FIPS / GEOID for spatial use.
Totally unclear what the units are on these columns... BTU Content, dry storage capacity, liquid storage capacity, plant capacity, plant flow
What does the zipcode refer to? Is that of the plant location? The company location?
Can we link the owner + operator companies to their analogs in the PHMSA data?
Why are there so few records in this dataset? Are there really only a couple hundred gas processing plants? Or are there a bunch more which aren't subject to EIA's reporting requirements.
The BTU content varies over a huge range. Too far for it to be simple natural gas heat content per uniform unit. So either there are very different kinds of gasses being reported, or different units being used.
Plant Capacity also varies over a couple of orders of magnitude, with most records clustered at the bottom of the spectrum. Might be another units issue? Or is there really a factor of 100x difference between the smallest and largest plants?
Structurally this table doesn't seem like it needs much if any reshaping. It's mostly going to be figuring out what the units are, what the meanings of the columns are, and massaging it to be linkable to other datasets (FIPS, company & plant names, etc)
For some reason there is only data from 2012, 2014, and 2017. Does this form not have a regular reporting schedule?
Significant dropoff in number of records in the 2017 data. Why no data for the last 7 years?
Looking at the form instructions it seems that the data is updated at most every 3 years, with no minimum required reporting frequency? Seems wacked: "Complete Schedule A if we request for this information. We will collect Schedule A reports no more than once every three years ." Here's the announcement of the form when it was first released in 2012.

Entities

Several kinds of entities are being referenced in these tables, and could potentially be pulled out into their own separate tables and linked to the data via FKs:

Companies of several types, maybe with overlapping membership in the different categories: Gas suppliers, storage facility operators, processing plant owners and operators.
Storage facilities
Reservoirs (how are these related to / distinct from storage facilities?)
Processing Plants

e-belfer commented 3 months ago

What does the itemsort field mean and do we need it? It seems closely related to the line field.

This is a short code for the line numbers on the form, with some lines referring to a combination of two rows (e.g. [10.1 + 11.1]). See p.16 of the the NGQS guide.

year and report_year appear to be redundant, but report_year shows up as a string for some reason.

We added report_year manually in the extraction because year wasn't reported in EIA 176 data but was determined by the URL call made during extraction - we should drop it for this dataset.

In terms of the lines and definitions, we'll want to refer to the NGQS guide - see Appendix A for definitions of all the atype and line and item mappings, though I was hoping we wouldn't have to do this manually with all the fields.

e-belfer commented 3 months ago

Some observations on EIA 176 IDs:

All start with "176"
All end with the state of the report - we can probably split this string to separate the company's EIA ID and the state of the report, which is already included in the state column.
When looking at the ID without the state code, there are 1924 companies with one ID of 2145. Some company names, such as "ADJUSTMENT COMPANY" turn up several or many IDs. Need to investigate what might explain this (common company name, changing ID over time, different states of operation, etc.).
Do not appear to simply overlap with EIA 191 IDs

e-belfer commented 3 months ago

This is out of date given current data availability, so I'm moving it out of the PR description and archiving it below:

Proposed table design

core_eia176__yearly_companies

Priority: High Source of raw data: Bundled (hopefully)

This table will include data on the company filling out the form, including their address, company characteristics and distribution territory. This will include Parts 1, 3, and 7 from EIA Form 176. The primary key should be EIA ID number, year, and state. Some of these characteristics could get harvested across states or years (e.g., company address) and characteristics, but this seems somewhat low-value add for now.

core_eia176__yearly_sources_dispositions

Priority: High Source of raw data: Bundled (hopefully)

This table will include data on a company's natural gas sources and dispositions, with each row representing one state's data. This will include Part 4 and Part 6 of Form EIA 176. This will give us information on international and cross-state gas transfers, numbers of residential, commercial, industrial etc end-use consumers receiving natural gas, and volumes and revenues associated with these dispositions. The primary key will be EIA ID number, year and state.

Ideally if a footnote in Part 7B is included, it should be attached in a column to the relevant data it refers to.

core_eia__yearly_liquefied_natural_gas_storage

Priority: medium Source of data: ?? + Bundled

This table will include EIA 191L data (see above) and Part 5 of the EIA 176 data. The primary key should be EIA ID number, year, and facility. The data included from EIA 191L should be from the end of December, and the volume and capacity should be directly comparable in the table so they can be validated against one another. 191L data does not seem to be included in all_data_191.csv so it would need to be obtained through a separate source. As a first step, we could just include EIA 176 data for now.

Ideally if a footnote in 176 Part 7B is included, it should be attached in a column to the relevant data it refers to.

core_eia191__monthly_underground_storage

Priority: medium Source of data: Bundled

This table will include EIA 191 data. The primary key should be EIA ID number, year-month, and facility (field_name), with each row representing one facility's report in a given month. The EIA 191 form is released monthly. all_data_191.csv looks like it has actual column names that correspond to the form, so we won't have to deal with the frustrating LINE renames here.

core_eia191__monthly_liquefied_natural_gas_storage

Priority: medium Source of data: ??

This table will include EIA 191L data, which isn't part of the bulk electricity zipfile as far as I can tell. The primary key should be EIA ID number, year-month, and facility, with each row representing one facility's report in a given month.

core_eia757__{freq}_processing_plants

Priority: medium Source of data: Bundled (incomplete) or ??

This table will include data from EIA Form 757 Schedule A. Schedule A, the Baseline Report is filled out no more often than every 3 years, and includes data on capacity, status, operations, and connecting infrastructure of natural gas processing plants. all_data_757.csv seems to only contain Part 1 and Part 5 of the report and does not include EIA ID, so we'd need to track down a downloadable form of the remaining table if we wanted to include it, which would probably involve using the API data. The primary key should be EIA ID, date of filing, and some identifying combination of plant fields (name and address, e.g.), with each row representing one plant's report.

core_eia757__{freq}_processing_disruptions

Priority: medium Source of data: ??

This table will include data from EIA Form 757 Schedule B. Schedule B monitors post-emergency natural gas processing plans operational statuses. all_data_757.csv does not contain this data so we'd need to track down a downloadable form of the data if we wanted to include it, which would probably involve using the API data. Primary key should be EIA ID, date of filing, and some combination of plant identifiers as required, with each row representing one plant's report.

zaneselvans commented 3 months ago

What tables are we sure we'll need?

Data Tables

core_eia176__yearly_natural_gas_sources_dispositions A tidy / normalized version of everything in raw_eia176__data that seems to be reported on an (ID, year, area) basis. I can imagine this table being broken out into separate thematic tables if the reshaping / normalization is simpler that way because the categorical values (e.g. customer type) don't apply to all the columns we've got after tidying. I think it'll be easier to identify the natural normalization of this data once it's been reshaped, so we should check in again at that point.
raw_eia191__data seems like it only contains core_eia191__monthly_natural_gas_storage
raw_eia757a__data seems like it only contains core_eia757a__natural_gas_processing_plant (not really yearly... irregular? But it's always a year)

Entity Tables

For most of our other datasets, we've called respondents "utilities" but it's not clear that applies here. What entity name do we want to use? respondent? company? Form respondent seems like the most generic and generally applicable but isn't very descriptive.

If there isn't much information associated with a respondent, and it only appears in a single table, it's probably not worth breaking an entity table out. E.g. in the case of the eia176 table, it seems like all we have is a respondent ID and name, and the IDs don't show up in the 191 data or anywhere else that we know of yet, so leaving the IDs & names in that table even if there's some duplication doesn't seem so terrible.

core_eia176__{entity|yearly}_respondents
core_eia191__{entity|yearly}_respondents
core_eia191__{entity|yearly}_geologic_reservoirs (Just saying _reservoirs feels pretty vague)
core_eia191__{entity_yearly}_gas_storage_fields (Just having _fields feels too vague)
core_eia757a__{entity|yearly}_operators (maybe operators and owners are actually both just respondents?
core_eia757a__{entity|yearly}_owners

denimalpaca commented 1 month ago

I've been looking at the EIA-191 data and the additions to it mentioned here and feel I can take on that work. Looks like the PK is defined already, and the columns that may require an ENUM or FK only have a few unique values there. One open question here for me is which, if any, fields need to be normalized and how (saw some mention of this in a previous comment).

zaneselvans commented 1 month ago

I haven't looked at it in a bit, but it seemed like there was probably one one real data table to be made in the EIA-191, and that as of yet there wasn't much benefit to stripping out entity (respondent) fields to make a separate more normalized table. So I'd probably try making a single core table for now.