catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Fix EIA boiler IDs that have inconsistenly reported leading zeros #2366

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

This issue is analogous to the problem we found and addressed in #964 and #968, but applied to boiler IDs rather than generator IDs.

With the integration of the extensive EIA-860 boiler attributes in #2305 and #2364 we now have many more examples of boiler ID values, and some of them have inconsistently reported leading zeroes. Boiler IDs are strings, but many of them are purely numerical strings, and all this data goes through Excel, so this kind of thing doesn't get treated uniformly across all the years of data.

bids = pd.read_sql("SELECT plant_id_eia, boiler_id, report_date FROM boilers_eia860", pudl_engine)
bids["boiler_id_orig"] = bids.boiler_id.copy()
fixed_bids = pudl.helpers.remove_leading_zeros_from_numeric_strings(bids, col_name="boiler_id")
fixed_bids["n_ids"] = fixed_bids.groupby(["plant_id_eia", "boiler_id"]).boiler_id_orig.transform("nunique")
fixed_bids = fixed_bids[fixed_bids.n_ids>1].sort_values(["plant_id_eia", "boiler_id"])
fixed_bids = fixed_bids.groupby(["plant_id_eia", "boiler_id"]).boiler_id_orig.apply(set).to_frame()
fixed_bids
All Inconsistent Boiler IDs | (plant_id_eia, boiler_id) | boiler_id_orig | |:---------------|:-----------------| | (202, '1') | {'1', '01'} | | (203, '1') | {'1', '01'} | | (861, '1') | {'1', '01'} | | (861, '2') | {'2', '02'} | | (863, '5') | {'05', '5'} | | (863, '6') | {'06', '6'} | | (864, '1') | {'1', '01'} | | (864, '2') | {'2', '02'} | | (864, '3') | {'03', '3'} | | (864, '4') | {'4', '04'} | | (864, '5') | {'05', '5'} | | (864, '6') | {'06', '6'} | | (2491, '1') | {'001', '1'} | | (6013, '1') | {'1', '01'} | | (6013, '2') | {'2', '02'} | | (10502, '2') | {'2', '02'} | | (10603, '31') | {'031', '31'} | | (50561, '1') | {'1', '0001'} | | (50561, '2') | {'2', '0002'} | | (50611, '31') | {'031', '31'} | | (50739, '100') | {'0100', '100'} | | (54526, '1') | {'1', '00001'} | | (54529, '1') | {'1', '0001'} | | (55210, '1') | {'1', '0001'} | | (55596, '1') | {'1', '0001'} | | (55596, '2') | {'2', '0002'} | | (55596, '3') | {'0003', '3'} | | (55596, '5') | {'5', '0005'} | | (56188, '1') | {'001', '1'} | | (56221, '31') | {'031', '31'} | | (56453, '31') | {'031', '31'} | | (57705, '1') | {'1', '000001'} | | (58574, '1') | {'1', '01'} | | (60903, '1') | {'1', '0001'} | | (60903, '2') | {'2', '0002'} |

Hopefully addressing this issue is as simple as treating Boiler IDs the same way we treat Generator IDs in the EIA spreadsheet extractions, adding a call to pudl.helpers.remove_leading_zeros_from_numeric_strings() to each of the extractor classes. Based on where we're doing this for the generator_id values, it looks like this fix may impact the following datasets:

I think most of those will go pretty smoothly, but I'm concerned about how this might interact with the EPA CAMD to EIA crosswalk.

zaneselvans commented 1 year ago

As feared, this does have an interaction with the EPA CAMD crosswalk, rendering these boiler_id values invalid, so we'll need to modify them as well.

 [
    {'resource': 'epacamd_eia',
    'foreign_key': ForeignKey(fields=['plant_id_eia', 'boiler_id'],
    reference=ForeignKeyReference(resource='boilers_entity_eia', fields=['plant_id_eia', 'boiler_id'])),
    'invalid':
plant_id_eia boiler_id
133            202        01
134            203        01
703            861        01
704            861        02
1775          2491       001
2888          6013        01
2889          6013        02
4014         10603       031
4399         50561      0001
4402         50561      0002
4409         50611       031
6355         56188       001
6820         60903      0001
6822         60903      0002}]