catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
478 stars 110 forks source link

Integrate PHMSA distribution data into PUDL #3770

Open e-belfer opened 3 months ago

e-belfer commented 3 months ago

PHMSA distribution data (1990-present)

PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.

Prep

Before getting into the table transforms, let's update the data itself.

### Tasks
- [x] Update DOI in `pudl.workspace.datastore` to latest archive
- [x] In `pudl.metadata.sources.SOURCES`, update the working_partitions to reflect the years, months, or quarters of data that are available
- [x] Update the partitions of data to be processed in the `etl_full.yml` and `etl_fast.yml` settings files stored under `src/pudl/package_data/settings`
- [x] Use the `pudl_datastore` script to download the new raw data
- [x] Run the extraction on the new data by generating `raw_phmsa` assets and debug any issues
- [x] Create new module: `transform.phmsa` to handle transforms

Table design

core_phmsagas__yearly_distribution_operators

Part A: Operator Information

Part D-I

### Operator Table Tasks
- [x] Standardize NAs
- [x] Standardize usage of office vs HQ address over time.
- [x] Standardize case for city, county, operator name, etc.
- [x] Standardize telephone and fax number format and drop (000)-000-0000
- [x] Make into dagster asset
- [ ] Add `phmsagas` module to `transform.init`
- [ ] Add `core_phmsagas` asset group to `etl.init`
- [ ] Check whether new columns exist in `pudl.metadata.fields.FIELD_METADATA`. If they do, make sure the descriptions and data types match. If the descriptions don’t match, you may need to define that column by source: `pudl.metadata.fields.FIELD_METADATA_BY_GROUP` or by table: `pudl.metadata.fields.FIELD_METADATA_BY_RESOURCE`. If the column is not in `pudl.metadata.fields.FIELD_METADATA`, add it. Each field needs a data type and description, and may also need units or constraints.
- [ ] Update any `pudl.metadata.codes`, `pudl.metadata.labels`, or `pudl.metadata.enums` pertaining to new or existing columns. For instance, commodity group should be encoded as a categorical with a corresponding encoding table, and state columns should be constrained by the existing `pudl.metadata.enums.US_STATES_TERRITORIES`.
- [ ] Define PK (probably operator ID, state, commodity group, report year and report submission type). Check for uniqueness.
- [ ] Add new columns and tables to the `RESOURCE_METADATA` dictionaries in the appropriate `pudl.metadata.resources` modules.
- [ ] Create alembic migration
- [ ] Validate: do sub-components sum to totals?

core_phmsagas__yearly_distribution_main_and_services

Part B: System Description

Each piece of this has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.

### Tasks
- [ ] Collapse the materials and sizes into categorical columns
- [ ] Follow the steps above

core_phmsa__yearly_distribution_leaks_and_repairs

Part C: Leaks and Repairs Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"

### Tasks
- [ ] Collapse cause of leak and type of pipe into categoricals
- [ ] Follow the steps above

General Known PHMSA Cleaning Steps

Out of scope

seeess1 commented 1 month ago

@e-belfer can you assign me to this ticket?

seeess1 commented 1 month ago

Working on this branch in my forked repo.