Open zaneselvans opened 3 years ago
I'm new around here, but my vote would be to have PUDL use the normalized field in the nicely named table, and to have an OG table that has the OG field that nobody joins to unless they really want to. If EIA fix their data, the OG table quietly disappears.
A portion of the
generator_id
values used across the EIA 923 and EIA 860 datasets are inconsistent with respect to their use of UPPER and lower case letters, meaning that some generators in different years (and potentially different tables) should be associated with each other, but currently are not. It may also be the case that non-alphanumeric characters are having the same impact.For example, if the following values of
generator_id
are found in association with the sameplant_id_eia
should they actually be treated differently? Or would we assume they are all referring to the same generator?It turns out that there are a small number of generator ID values that have all of these problems (differentiated only by whitespace, by other non-alphanumeric characters, or by capitalization). To see what these look like...
The obvious fix (akin to things we've done in other columns with inconsistently reported categorical values) is to apply some standardized formatting to the value in the columns. E.g. stripping out all non-alphanumeric characters and translating everything to UPPER CASE.
HOWEVER doing so in this case would make our IDs generally incompatible with the IDs being used by others to refer to the EIA data, even though it would be internally more consistent and easier to deal with. This would be a major inconvenience to users who are trying to merge the PUDL data with other data containing EIA generator IDs.
A less invasive approach here would be to fix only the IDs which have been reported inconsistently, of which there are a few dozen. These IDs are already inconsistent / broken, and if we strip non-alphanumeric characters and UPPER CASE the letters, in most cases we'll be changing an inconsistent value to match another value that was reported. Unfortunately, identifying all of the inconsistently reported values can't be done table by table, it has to happen after the harvesting process has gathered all of the different IDs together, and then the ID translation would need to be applied to any table containing the
generator_id
column.But this in turn might introduce duplicate rows, breaking the primary key constraints. So it's not clear what the right solution is.
This may be something we try to get EIA to fix.