catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 106 forks source link

Move slowly varying columns from plants_entity_eia to annual plants_eia860 #1748

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

In working on the EIA fuel price imputation (See #1708) it became clear that the EIA Sector ID assigned to a plant does change occasionally (e.g. when a plant is sold by a regulated utility to an IPP / merchant generator). This is one of the fields used in reporting aggregated fuel prices in the EIA API, which contains information about redacted prices.

To more accurately reflect this slow-changing variable and enable reproduction of the EIA aggregations, move sector_id_eia and sector_name_eia from the plants_entity_eia table (which contains fixed attributes) to the plants_eia860 table, which contains annual attributes.

While I'm in here shuffling these fields around.... here are the fixed plant attributes:

        "static_cols": [
            "balancing_authority_code_eia",  # moved to plants_eia860
            "balancing_authority_name_eia",  # moved to plants_eia860
            "city",
            "county",
            "ferc_cogen_status",  # moved to plants_eia860
            "ferc_exempt_wholesale_generator",  # moved to plants_eia860
            "ferc_small_power_producer",  # moved to plants_eia860
            "grid_voltage_2_kv",  # moved to plants_eia860
            "grid_voltage_3_kv",  # moved to plants_eia860
            "grid_voltage_kv",  # moved to plants_eia860
            "iso_rto_code",  # moved to plants_eia860
            "latitude",
            "longitude",
            "plant_name_eia",
            "primary_purpose_id_naics",  # moved to plants_eia860
            "state",
            "street_address",
            "zip_code",
        ],

Others that I think should probably be allowed to vary annually, due to the fact that their territories shift over time:

Other possibilities...

@TrentonBush do you have thoughts?

zaneselvans commented 1 year ago

From @TrentonBush:

Without knowing the historical values of these fields, I don't actually know which can change. I can imagine that plant names, BA & ISO membership, FERC statuses and street address can change. But there are established ways to model slowly changing dimensions without having to choose between static vs high frequency representation: add columns start_date/end_date or valid_from/valid_to and add a new row when something changes. See "SCD Type 2" in this article

zaneselvans commented 1 year ago

@TrentonBush I don't think there's a lot of harm in having the high frequency representation for now, and it'll be more accurate than the current homogenization that we get from forcing them to be static. But maybe it would be appropriate to migrate to these other kinds of slowly changing representations once we're actually populating a data warehouse and have more in-DB representations of the information? Integrating those schema changes now sounds like it would be painful.

It's also sometimes hard to differentiate between "changes slowly" and data entry errors.

zaneselvans commented 1 year ago

I think all the location-based columns should probably be treated as static. If there are changes in the address across years, I think they're almost all going to be just different representations of the same address, which should be consolidated into a single value. And state, county, city, zip code, timezone, latitude and longitude should be pretty permanent.

I'm torn on plant_name_eia. It's mostly for human readability / recognition. I guess it does very occasionally change though, and we use it for matching FERC and EIA plants to each other programmatically, and in that context I guess having the updated names could be helpful in the rare instances where they change.

TrentonBush commented 1 year ago

I started writing out a comment about where we draw the boundary between "small-t transforms" vs "big-T Transforms" as they relate to de-noising quasi-static attributes and reconciling conflicting data sources (aka "harvesting"), but then realized that we don't keep anything except the final outputs, so that boundary is currently irrelevant.

As it stands now, these modeling decisions feel high-stakes because all assumptions are final, data loss is permanent, and a differently-opinioned user's only recourse is to start from scratch with cumbersome raw data files (cost prohibitive). Ultimately I think a lot of this problem can be avoided if we store light-touch intermediate data in the database, probably in a separate schema namespace.

I guess this comment is more about the harvesting refactor than this particular issue, but essentially I think we should keep the full time resolution, noisy data somewhere in the database and offer a simplified version for convenience.

But that is in the future. For today's column assignment problem, I agree with your allocations. I'm tempted to put plant name and address in the time varying table, but can see how that extra information probably comes at a cost of annoying most users who don't really care about the greatest accuracy there.

zaneselvans commented 1 year ago

Ultimately I think a lot of this problem can be avoided if we store light-touch intermediate data in the database, probably in a separate schema namespace.

This is definitely the longer term plan! But we have to get the Dagster stuff implemented and start building a real data warehouse.

I'm tempted to put plant name and address in the time varying table

I was very tempted to put plant_name_eia in the annual table, and I think we should ultimately because we use it for matching with FERC so tracking real name changes will be helpful. But this will definitely have downstream impacts that I don't think are within the scope of the current changes.