catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Faulty encoding in the EIA transformers #3542

Closed aesharpe closed 2 months ago

aesharpe commented 2 months ago

Describe the bug

Some of the transform functions for EIA tables (specifically in EIA923, but potentially 860 as well) are running the encode step using resource metadata for normalized, post-harvest tables that don't have key encodable columns in their schema. This means that encodable columns in the denormalized tables are not getting encoded before being passed on to the harvesting process. The main culprit columns getting missed by this encoding error are prime_mover_code and ba_code.

See branch test-eia-transform-encoders for notebook explanation / code example

Bug Severity

Low - it's not a blocking error and the changes of there being bad prime mover or balancing authority codes is relatively low. Moreover, the harvesting process itself will probably weed them out.

However, if there's an encodable column in a table we should be making sure that it gets encoded via some sort of check rather than just relying on the person writing the transform function to pick the right table to pass to the encoder step. There is a world where the harvesting process gets messed up due to bad data bypassing the encoding step.

To Reproduce

Expected behavior

Passing a table to the encoder that does not have an encodable column (such as prime_mover_code) in its schema means that that column will not get encoded in the core table.

In it's current state, I expect the following encodable columns to bypass the encoding step because the table being passed into the encode function does not have them in their schema (there could be more, this is just a cursory check of eia923 -- should check eia860 too):

Suggested Fix

Nancy9ice commented 2 months ago

Hello, I would like to work on this issue. Do I need to work directly on the test-eia-transform-encoders branch or on the main branch?

aesharpe commented 2 months ago

Hi @Nancy9ice I think @zaneselvans is taking a look at this, but feel free to make a suggestion. I don't think we're ready to dive in and fix anything yet - still figuring out the best design for a solution.

If you want to help out more broadly, I recommend attending our office hours. Right now it's a little tricky to jump in blind!

Nancy9ice commented 2 months ago

Okay, I understand. I just got on the interface to attend the office hours and I noticed that some questions asked are for those that want to use the PUDL probably for work purposes. If I'm just interested in contributing to the PUDL open source project, am I still eligible to attend the office hours? @aesharpe

aesharpe commented 2 months ago

@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.

zaneselvans commented 2 months ago

I think is potentially a pretty serious issue, since it means we aren't always feeding good categorical values into the harvesting process, and I suspect there are dozens of columns affected right now.

It seems like this problem grew out of our prior practice of not including all of the available columns for harvesting, if they were going to get dropped eventually anyway.

If every column name is encoded by at most one coding table, then we could just look at the global list of all foreign key in the database schema, and look up the encoder (if any exists) for each column in the dataframe being processed, rather than linking it to a particular table/resource definition. But if it's ambiguous which coding table you should be looking at then this will break down (e.g. if there were a FERC fuel_type_code column and an EIA fuel_type_code column, and they were linked to different coding tables, you wouldn't know a priori which one to use based on just the column name)

cmgosnell commented 2 months ago

a few musings:

We should also probably develop a check that makes sure encodable columns are getting encoded.

if i understand correctly, all of the encoded columns feed into enum data types in the SQL schema and do get checked.

cmgosnell commented 2 months ago

@zaneselvans in order to address your which set of codes to apply problem: because this is primarily a pre-harvesting issue (i believe) couldn't we restrict the encode-able columns to just eia? via either the etl_group or the sources or the field_namespace?

Nancy9ice commented 2 months ago

@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.

Okay, thank you. I'll look at it and fix a time. Meanwhile, I just added an item to the 'Discussion' section on a problem that I'm experiencing in setting up my development environment. I tagged you to it. Please help me check it as I need feedback as soon as possible🙏. Thank you