Closed cmgosnell closed 3 years ago
I think we might want to add an _early_transform()
function to the eia860
and eia923
transform modules like we created for eia861
which applies transformations to all the EIA dataframes. Or I guess it would be an EIA level helper function that's shared across all of those modules. It could run fix_eia_na
and convert_to_date
and this new fix_generator_id
function, before any of the other transform functions happen on a table by table basis. This would also get rid of some duplication in the existing transform functions.
Describe the bug
Ahhh. There are leading zeros in generator IDs. They appear for some generators for some years, from some tables.
I expect this is happening because of the data types in the excel sheets and/or on import. Excel will strip leading zeros of cells formatted as numbers, but will not do this when they are not formatted as numbers.
pd.read_excel
assumes the column is a mix of data types.Bug Severity
How badly is this bug affecting you?
There are a few thousand generator records with this issue out of 300 thousand. So this is effecting a small % of overall generators.
Expected behavior
Generators should have consistent IDs regardless of what table the original data came from (or what year the data came from). We should fix this within the transform step. And it should be applied uniformly across all EIA tables before the harvesting/normalization happens. I'm thinking it should be applied within the dtype helper function... because all tables get run through that process. The only issue there is that this would be adding another special column cleaning to that function.
Sample fix from @zaneselvans :
Some things to safeguard against:
1
and001
. (whyyy? but why not ya know)