Leading zeros in generator IDs

Describe the bug

Ahhh. There are leading zeros in generator IDs. They appear for some generators for some years, from some tables.

I expect this is happening because of the data types in the excel sheets and/or on import. Excel will strip leading zeros of cells formatted as numbers, but will not do this when they are not formatted as numbers. pd.read_excel assumes the column is a mix of data types.

Bug Severity

How badly is this bug affecting you?

Medium: With some effort, I can work around the bug.

There are a few thousand generator records with this issue out of 300 thousand. So this is effecting a small % of overall generators.

Expected behavior

Generators should have consistent IDs regardless of what table the original data came from (or what year the data came from). We should fix this within the transform step. And it should be applied uniformly across all EIA tables before the harvesting/normalization happens. I'm thinking it should be applied within the dtype helper function... because all tables get run through that process. The only issue there is that this would be adding another special column cleaning to that function.

Sample fix from @zaneselvans :

# Remove ANY leading zeroes, even when there are letters in the generator_id
#gens["fixed_id"] = gens.generator_id.apply(lambda x: re.sub("^0+", "", x))
# Remove ONLY leading zeroes followed exclusively by digits in the generator_id
gens["fixed_id"] = gens.generator_id.apply(lambda x: re.sub(r'^0+(\d+$)', r'\1', x))

# Dataframe with all records we fixed:
fixed_gens = gens.loc[gens.generator_id != gens.fixed_id]
logger.info(f"Fixed leading-zero generator_id in {len(fixed_gens)} records.")

Some things to safeguard against:

plants which truly have a generator id 1 and 001. (whyyy? but why not ya know)
are there leading zero issues with only numeric-like id's or those with letters?

catalyst-cooperative / pudl