Closed aesharpe closed 2 months ago
Hello, I would like to work on this issue. Do I need to work directly on the test-eia-transform-encoders branch or on the main branch?
Hi @Nancy9ice I think @zaneselvans is taking a look at this, but feel free to make a suggestion. I don't think we're ready to dive in and fix anything yet - still figuring out the best design for a solution.
If you want to help out more broadly, I recommend attending our office hours. Right now it's a little tricky to jump in blind!
Okay, I understand. I just got on the interface to attend the office hours and I noticed that some questions asked are for those that want to use the PUDL probably for work purposes. If I'm just interested in contributing to the PUDL open source project, am I still eligible to attend the office hours? @aesharpe
@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.
I think is potentially a pretty serious issue, since it means we aren't always feeding good categorical values into the harvesting process, and I suspect there are dozens of columns affected right now.
It seems like this problem grew out of our prior practice of not including all of the available columns for harvesting, if they were going to get dropped eventually anyway.
If every column name is encoded by at most one coding table, then we could just look at the global list of all foreign key in the database schema, and look up the encoder (if any exists) for each column in the dataframe being processed, rather than linking it to a particular table/resource definition. But if it's ambiguous which coding table you should be looking at then this will break down (e.g. if there were a FERC fuel_type_code
column and an EIA fuel_type_code
column, and they were linked to different coding tables, you wouldn't know a priori which one to use based on just the column name)
a few musings:
transform.eia.finished_eia_asset_factory
... or is the encoding happening automatically to ever asset that gets written to the db? so this step is unnecessary?We should also probably develop a check that makes sure encodable columns are getting encoded.
if i understand correctly, all of the encoded columns feed into enum data types in the SQL schema and do get checked.
@zaneselvans in order to address your which set of codes to apply problem: because this is primarily a pre-harvesting issue (i believe) couldn't we restrict the encode-able columns to just eia? via either the etl_group
or the sources
or the field_namespace
?
@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.
Okay, thank you. I'll look at it and fix a time. Meanwhile, I just added an item to the 'Discussion' section on a problem that I'm experiencing in setting up my development environment. I tagged you to it. Please help me check it as I need feedback as soon as possible🙏. Thank you
Describe the bug
Some of the transform functions for EIA tables (specifically in EIA923, but potentially 860 as well) are running the
encode
step using resource metadata for normalized, post-harvest tables that don't have key encodable columns in their schema. This means that encodable columns in the denormalized tables are not getting encoded before being passed on to the harvesting process. The main culprit columns getting missed by this encoding error areprime_mover_code
andba_code
.See branch
test-eia-transform-encoders
for notebook explanation / code exampleBug Severity
Low - it's not a blocking error and the changes of there being bad prime mover or balancing authority codes is relatively low. Moreover, the harvesting process itself will probably weed them out.
However, if there's an encodable column in a table we should be making sure that it gets encoded via some sort of check rather than just relying on the person writing the transform function to pick the right table to pass to the encoder step. There is a world where the harvesting process gets messed up due to bad data bypassing the encoding step.
To Reproduce
test-eia-transform-encoders
branchtest_encoder_in_eia923_transform.ipynb
and read the description_core_eia923__generation
function encoder step to see how it changes depending on which table you pass in.Expected behavior
Passing a table to the encoder that does not have an encodable column (such as
prime_mover_code
) in its schema means that that column will not get encoded in thecore
table.In it's current state, I expect the following encodable columns to bypass the encoding step because the table being passed into the encode function does not have them in their schema (there could be more, this is just a cursory check of eia923 -- should check eia860 too):
_core_eia923__fuel_receipts_costs
ba_code
_core_eia923__cooling_system_information
ba_code
_core_eia923__generation_fuel
ba_code
_core_eia923__generation_fuel_nuclear
ba_code
_core_eia923__fgd_operation_maintenance
ba_code
_core_eia923__generation
ba_code
prime_mover_code
Suggested Fix