catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Leading zeros in generator IDs #964

Closed cmgosnell closed 3 years ago

cmgosnell commented 3 years ago

Describe the bug

Ahhh. There are leading zeros in generator IDs. They appear for some generators for some years, from some tables.

image

I expect this is happening because of the data types in the excel sheets and/or on import. Excel will strip leading zeros of cells formatted as numbers, but will not do this when they are not formatted as numbers. pd.read_excel assumes the column is a mix of data types.

Bug Severity

How badly is this bug affecting you?

There are a few thousand generator records with this issue out of 300 thousand. So this is effecting a small % of overall generators.

Expected behavior

Generators should have consistent IDs regardless of what table the original data came from (or what year the data came from). We should fix this within the transform step. And it should be applied uniformly across all EIA tables before the harvesting/normalization happens. I'm thinking it should be applied within the dtype helper function... because all tables get run through that process. The only issue there is that this would be adding another special column cleaning to that function.

Sample fix from @zaneselvans :

# Remove ANY leading zeroes, even when there are letters in the generator_id
#gens["fixed_id"] = gens.generator_id.apply(lambda x: re.sub("^0+", "", x))
# Remove ONLY leading zeroes followed exclusively by digits in the generator_id
gens["fixed_id"] = gens.generator_id.apply(lambda x: re.sub(r'^0+(\d+$)', r'\1', x))

# Dataframe with all records we fixed:
fixed_gens = gens.loc[gens.generator_id != gens.fixed_id]
logger.info(f"Fixed leading-zero generator_id in {len(fixed_gens)} records.")

Some things to safeguard against:

zaneselvans commented 3 years ago

I think we might want to add an _early_transform() function to the eia860 and eia923 transform modules like we created for eia861 which applies transformations to all the EIA dataframes. Or I guess it would be an EIA level helper function that's shared across all of those modules. It could run fix_eia_na and convert_to_date and this new fix_generator_id function, before any of the other transform functions happen on a table by table basis. This would also get rid of some duplication in the existing transform functions.