Closed zaneselvans closed 1 year ago
As feared, this does have an interaction with the EPA CAMD crosswalk, rendering these boiler_id
values invalid, so we'll need to modify them as well.
[
{'resource': 'epacamd_eia',
'foreign_key': ForeignKey(fields=['plant_id_eia', 'boiler_id'],
reference=ForeignKeyReference(resource='boilers_entity_eia', fields=['plant_id_eia', 'boiler_id'])),
'invalid':
plant_id_eia boiler_id
133 202 01
134 203 01
703 861 01
704 861 02
1775 2491 001
2888 6013 01
2889 6013 02
4014 10603 031
4399 50561 0001
4402 50561 0002
4409 50611 031
6355 56188 001
6820 60903 0001
6822 60903 0002}]
This issue is analogous to the problem we found and addressed in #964 and #968, but applied to boiler IDs rather than generator IDs.
With the integration of the extensive EIA-860 boiler attributes in #2305 and #2364 we now have many more examples of boiler ID values, and some of them have inconsistently reported leading zeroes. Boiler IDs are strings, but many of them are purely numerical strings, and all this data goes through Excel, so this kind of thing doesn't get treated uniformly across all the years of data.
boilers_eia860
table with inconsistent leading zeros in their IDs across all years of data.boiler_id
values which are entirely numeric (as is the case withgenerator_id
)boiler_id
values correspond to which original values, after fixing them with the same technique that we're using on thegenerator_id
values:All Inconsistent Boiler IDs
| (plant_id_eia, boiler_id) | boiler_id_orig | |:---------------|:-----------------| | (202, '1') | {'1', '01'} | | (203, '1') | {'1', '01'} | | (861, '1') | {'1', '01'} | | (861, '2') | {'2', '02'} | | (863, '5') | {'05', '5'} | | (863, '6') | {'06', '6'} | | (864, '1') | {'1', '01'} | | (864, '2') | {'2', '02'} | | (864, '3') | {'03', '3'} | | (864, '4') | {'4', '04'} | | (864, '5') | {'05', '5'} | | (864, '6') | {'06', '6'} | | (2491, '1') | {'001', '1'} | | (6013, '1') | {'1', '01'} | | (6013, '2') | {'2', '02'} | | (10502, '2') | {'2', '02'} | | (10603, '31') | {'031', '31'} | | (50561, '1') | {'1', '0001'} | | (50561, '2') | {'2', '0002'} | | (50611, '31') | {'031', '31'} | | (50739, '100') | {'0100', '100'} | | (54526, '1') | {'1', '00001'} | | (54529, '1') | {'1', '0001'} | | (55210, '1') | {'1', '0001'} | | (55596, '1') | {'1', '0001'} | | (55596, '2') | {'2', '0002'} | | (55596, '3') | {'0003', '3'} | | (55596, '5') | {'5', '0005'} | | (56188, '1') | {'001', '1'} | | (56221, '31') | {'031', '31'} | | (56453, '31') | {'031', '31'} | | (57705, '1') | {'1', '000001'} | | (58574, '1') | {'1', '01'} | | (60903, '1') | {'1', '0001'} | | (60903, '2') | {'2', '0002'} |Hopefully addressing this issue is as simple as treating Boiler IDs the same way we treat Generator IDs in the EIA spreadsheet extractions, adding a call to
pudl.helpers.remove_leading_zeros_from_numeric_strings()
to each of the extractor classes. Based on where we're doing this for thegenerator_id
values, it looks like this fix may impact the following datasets:eia860
eia860m
eia861
eia923
epacems
epacamd_eia
I think most of those will go pretty smoothly, but I'm concerned about how this might interact with the EPA CAMD to EIA crosswalk.