The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.
Prep
Before getting into the table transforms, let's update the data itself.
### Tasks
- [x] Update DOI in `pudl.workspace.datastore` to latest archive
- [x] In `pudl.metadata.sources.SOURCES`, update the working_partitions to reflect the years, months, or quarters of data that are available
- [x] Update the partitions of data to be processed in the `etl_full.yml` and `etl_fast.yml` settings files stored under `src/pudl/package_data/settings`
- [x] Use the `pudl_datastore` script to download the new raw data
- [x] Run the extraction on the new data by generating `raw_phmsa` assets and debug any issues
- [x] Create new module: `transform.phmsa` to handle transforms
Table design
All form parts below are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables. The original files (CSV and Excel) are not split by form part.
Each report corresponds to 1 state and one commodity group. Further parts of the table have dozens of columns that we should tidy into one categorical column, creating different PKs within the parts of the form. We should try to structure our tables to match these PKs as much as possible.
core_phmsagas__yearly_distribution_operators
Part A: Operator Information
Essentially just the operator ID, name, address, and info on the state and commodity pertinent to the report.
Part D-I
Each field occurs once per report, these should probably be kept in one table with Part A.
Part D, Excavation Damage: 6 fields of summary stats
Part E, EFV and Service Valve Data: 4 fields of summary stats
Part F: Total number of leaks on federal lands scheduled for repair or repaired.
Part G: %age of unaccounted gas.
Part H: Additional Info: a Notes field, discusses corrections, changes in calculations, and other ambiguities.
Part I: Preparer's info: Email, initial or supplemental report, contact info for preparer.
### Operator Table Tasks
- [x] Standardize NAs
- [x] Standardize usage of office vs HQ address over time.
- [x] Standardize case for city, county, operator name, etc.
- [x] Standardize telephone and fax number format and drop (000)-000-0000
- [x] Make into dagster asset
- [ ] Add `phmsagas` module to `transform.init`
- [ ] Add `core_phmsagas` asset group to `etl.init`
- [ ] Check whether new columns exist in `pudl.metadata.fields.FIELD_METADATA`. If they do, make sure the descriptions and data types match. If the descriptions don’t match, you may need to define that column by source: `pudl.metadata.fields.FIELD_METADATA_BY_GROUP` or by table: `pudl.metadata.fields.FIELD_METADATA_BY_RESOURCE`. If the column is not in `pudl.metadata.fields.FIELD_METADATA`, add it. Each field needs a data type and description, and may also need units or constraints.
- [ ] Update any `pudl.metadata.codes`, `pudl.metadata.labels`, or `pudl.metadata.enums` pertaining to new or existing columns. For instance, commodity group should be encoded as a categorical with a corresponding encoding table, and state columns should be constrained by the existing `pudl.metadata.enums.US_STATES_TERRITORIES`.
- [ ] Define PK (probably operator ID, state, commodity group, report year and report submission type). Check for uniqueness.
- [ ] Add new columns and tables to the `RESOURCE_METADATA` dictionaries in the appropriate `pudl.metadata.resources` modules.
- [ ] Create alembic migration
- [ ] Validate: do sub-components sum to totals?
Miles of main and services by decade of installation. (This has a different categorical and might want to be its own table)
Each piece of this has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.
### Tasks
- [ ] Collapse the materials and sizes into categorical columns
- [ ] Follow the steps above
core_phmsa__yearly_distribution_leaks_and_repairs
Part C: Leaks and Repairs
Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"
### Tasks
- [ ] Collapse cause of leak and type of pipe into categoricals
- [ ] Follow the steps above
General Known PHMSA Cleaning Steps
Convert 2-digit report_year into 4-digit years (pre 2000).
Standardize report_state to use either shorthand or full state name
Ideally, adapt existing wide_to_tidy infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).
Standardize the multiple treatments of time noted in #3277 (filing date, data date, revision date)
Handle different aggregations of reporting over time for each form section (e.g., 1 form per state, one form per system)
Where granularity increases over time (e.g. onshore becomes onshore types A, B, C), aggregate these increasingly disaggregated columns back to have comparable totals over time.
Deal with extremely varying telephone formats
Standardize use of office vs. HQ addresses over time, and do general address cleaning
PHMSA distribution data (1990-present)
PHMSA distribution data explains the extent, safety record, and characteristics of each operator's distribution system by state and commodity group. The first goal for PHMSA data integration should be to create cleaned and normalized tables from the PHMSA distribution data (likely 3-4 separate tables). Unlike transmission, distribution is all published in one form. The raw data is archived here, and each zip file contains the PDF form that is filled out, which will be helpful for visualizing data structure.
Prep
Before getting into the table transforms, let's update the data itself.
Table design
core_phmsagas__yearly_distribution_operators
Part A: Operator Information
Part D-I
core_phmsagas__yearly_distribution_main_and_services
Part B: System Description
Each piece of this has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.
core_phmsa__yearly_distribution_leaks_and_repairs
Part C: Leaks and Repairs Table is organized by cause of leak and type of pipe. There are two additional fields at the bottom for "known leaks scheduled for repairs" and "leaks involving mechanical joint failure"
General Known PHMSA Cleaning Steps
report_year
into 4-digit years (pre 2000).report_state
to use either shorthand or full state namewide_to_tidy
infrastructure to drastically collapse tables using categoricals (e.g., a column for "location" that includes onshore, offshore, total rather than 3x the columns).Out of scope