Faulty encoding in the EIA transformers

aesharpe commented 2 months ago

Describe the bug

Some of the transform functions for EIA tables (specifically in EIA923, but potentially 860 as well) are running the encode step using resource metadata for normalized, post-harvest tables that don't have key encodable columns in their schema. This means that encodable columns in the denormalized tables are not getting encoded before being passed on to the harvesting process. The main culprit columns getting missed by this encoding error are prime_mover_code and ba_code.

See branch test-eia-transform-encoders for notebook explanation / code example

Bug Severity

Low - it's not a blocking error and the changes of there being bad prime mover or balancing authority codes is relatively low. Moreover, the harvesting process itself will probably weed them out.

However, if there's an encodable column in a table we should be making sure that it gets encoded via some sort of check rather than just relying on the person writing the transform function to pick the right table to pass to the encoder step. There is a world where the harvesting process gets messed up due to bad data bypassing the encoding step.

To Reproduce

Checkout the test-eia-transform-encoders branch
Open the notebook test_encoder_in_eia923_transform.ipynb and read the description
Play with the inputs to the _core_eia923__generation function encoder step to see how it changes depending on which table you pass in.

Expected behavior

Passing a table to the encoder that does not have an encodable column (such as prime_mover_code) in its schema means that that column will not get encoded in the core table.

In it's current state, I expect the following encodable columns to bypass the encoding step because the table being passed into the encode function does not have them in their schema (there could be more, this is just a cursory check of eia923 -- should check eia860 too):

_core_eia923__fuel_receipts_costs
- ba_code
_core_eia923__cooling_system_information
- ba_code
_core_eia923__generation_fuel
- ba_code
_core_eia923__generation_fuel_nuclear
- ba_code
_core_eia923__fgd_operation_maintenance
- ba_code
_core_eia923__generation
- ba_code
- prime_mover_code

Suggested Fix

We should make sure that all encodable columns are getting encoded in the transform step by passing appropriate tables to the encoder in each transform.
We should also probably develop a check that makes sure encodable columns are getting encoded.

Nancy9ice commented 2 months ago

Hello, I would like to work on this issue. Do I need to work directly on the test-eia-transform-encoders branch or on the main branch?

aesharpe commented 2 months ago

Hi @Nancy9ice I think @zaneselvans is taking a look at this, but feel free to make a suggestion. I don't think we're ready to dive in and fix anything yet - still figuring out the best design for a solution.

If you want to help out more broadly, I recommend attending our office hours. Right now it's a little tricky to jump in blind!

Nancy9ice commented 2 months ago

Okay, I understand. I just got on the interface to attend the office hours and I noticed that some questions asked are for those that want to use the PUDL probably for work purposes. If I'm just interested in contributing to the PUDL open source project, am I still eligible to attend the office hours? @aesharpe

aesharpe commented 2 months ago

@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.

zaneselvans commented 2 months ago

I think is potentially a pretty serious issue, since it means we aren't always feeding good categorical values into the harvesting process, and I suspect there are dozens of columns affected right now.

It seems like this problem grew out of our prior practice of not including all of the available columns for harvesting, if they were going to get dropped eventually anyway.

If every column name is encoded by at most one coding table, then we could just look at the global list of all foreign key in the database schema, and look up the encoder (if any exists) for each column in the dataframe being processed, rather than linking it to a particular table/resource definition. But if it's ambiguous which coding table you should be looking at then this will break down (e.g. if there were a FERC fuel_type_code column and an EIA fuel_type_code column, and they were linked to different coding tables, you wouldn't know a priori which one to use based on just the column name)

cmgosnell commented 2 months ago

a few musings:

In the harvesting process: could we apply an encoding step in the beginning of the harvesting step that just searched for any column in the df that was "encodable" and applied the encoding? The current encoding process is tied to a given resources, but it is really the fields themselves that get cleaned. We could just compile all of the encode-able fields from the metadata and encode any field that shows up in a df. That way all of the inputs to the harvesting process would be clean and we'd probably get the maximum amount to harvested fields out
for the non-harvested core eia tables: we could just always encode the tables with the recource name in transform.eia.finished_eia_asset_factory... or is the encoding happening automatically to ever asset that gets written to the db? so this step is unnecessary?

We should also probably develop a check that makes sure encodable columns are getting encoded.

if i understand correctly, all of the encoded columns feed into enum data types in the SQL schema and do get checked.

cmgosnell commented 2 months ago

@zaneselvans in order to address your which set of codes to apply problem: because this is primarily a pre-harvesting issue (i believe) couldn't we restrict the encode-able columns to just eia? via either the etl_group or the sources or the field_namespace?

Nancy9ice commented 2 months ago

@Nancy9ice yes, office hours are for anyone. The questions are just intended to get to know you and your intentions so we can pick the best person on our team to join the call.

Okay, thank you. I'll look at it and fix a time. Meanwhile, I just added an item to the 'Discussion' section on a problem that I'm experiencing in setting up my development environment. I tagged you to it. Please help me check it as I need feedback as soon as possible🙏. Thank you

catalyst-cooperative / pudl