catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Integrate PHMSA transmission and distribution data #2848

Open e-belfer opened 1 year ago

e-belfer commented 1 year ago

Scope of PR:

Produce PHMSA assets in dagster from the transmission and distribution tables, mirroring the format of the most recent year's form.

Design Notes

Transmission Data (1970-present)

Table design

All form parts are based on the 2021 form. Form part letters change over time, so this will require pairing older fields from .xlsx files to their correct tables.

Parts A-D: core_phmsa__yearly_transmission_summary_by_commodity

Parts F-G: core_phmsa__yearly_inspections_and_assessments

Part H: core_phmsa__yearly_miles_of_transmission_pipe_by_nps

Part I: core_phmsa__yearly_miles_of_gathering_pipe_by_nps

Part J: core_phmsa__yearly_miles_of_pipe_by_decade_installed

Part K: core_phmsa__yearly_miles_of_transmission_pipe_by_specified_minimum_yield_strength

Part L: core_phmsa__yearly_miles_of_pipe_by_class_location

Part M: core_phmsa__yearly_failures_leaks_repairs

Part P: core_phmsa__yearly_miles_of_pipe_by_material

Part Q: core_phmsa__yearly_gas_transmission_miles_by_maop_determination_method

Part R: core_phmsa__yearly_gas_transmission_miles_by_pt_range_and_internal_inspection

Part S: core_phmsa__yearly_transmission_materials_verification

Part T: core_phmsa__yearly_transmission_hca_miles_by_determination_method_and_risk_model

Misc.

Distribution Data (1970-present)

Table design

Part A: Operator Information

core_phmsagas__yearly_distribution_main_and_services

Part B: System Description

  1. Summary stats on miles and services by material.
  2. Miles of main by material and size of pipe.
  3. Miles of services by material and size of pipe.
  4. Miles of main and services by decade of installation.
    • Each part is a table that has a bunch of different categoricals (e.g. pipe size) that will want to get tidied, changing the PK of the table.

core_phmsa__yearly_distribution_leaks_and_repairs

Part C: Leaks and Repairs

Part D-I

### Adapt infrastructure to handle PHMSA partitions
- [ ] https://github.com/catalyst-cooperative/pudl/issues/2850
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/issues/199
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/pull/252
- [ ] https://github.com/catalyst-cooperative/pudl-archiver/pull/253
### Extraction into raw assets
- [ ] https://github.com/catalyst-cooperative/pudl/pull/2932
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3241
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3243
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3248
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3253
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3263
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3259
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3278
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3275
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3268
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3261
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3265
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3277
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3290

After raw assets are extracted, we will have to define a core set of transformations for the PHMSA data. For each table, this could include: defining all columns with a datatype, transforming columns into categoricals to reduce the width of the table, defining primary keys, standardizing NAs.

Known cleaning steps:

### First round of cleaning into core assets
- [ ] https://github.com/catalyst-cooperative/pudl/issues/3770
- [ ] Transmission A-D
- [ ] Transmission F-G
- [ ] Transmission H
- [ ] Transmission I
- [ ] Transmission J
- [ ] Transmission K
- [ ] Transmission L
- [ ] Transmission M
- [ ] Transmission N-O
- [ ] Transmission P
- [ ] Transmission Q
- [ ] Transmission R
- [ ] Transmission S
- [ ] Transmission T
e-belfer commented 12 months ago

2017-2021 Excel and CSV comparison: