Closed zaneselvans closed 1 year ago
From @TrentonBush:
Without knowing the historical values of these fields, I don't actually know which can change. I can imagine that plant names, BA & ISO membership, FERC statuses and street address can change. But there are established ways to model slowly changing dimensions without having to choose between static vs high frequency representation: add columns start_date/end_date or valid_from/valid_to and add a new row when something changes. See "SCD Type 2" in this article
@TrentonBush I don't think there's a lot of harm in having the high frequency representation for now, and it'll be more accurate than the current homogenization that we get from forcing them to be static. But maybe it would be appropriate to migrate to these other kinds of slowly changing representations once we're actually populating a data warehouse and have more in-DB representations of the information? Integrating those schema changes now sounds like it would be painful.
It's also sometimes hard to differentiate between "changes slowly" and data entry errors.
I think all the location-based columns should probably be treated as static. If there are changes in the address across years, I think they're almost all going to be just different representations of the same address, which should be consolidated into a single value. And state, county, city, zip code, timezone, latitude and longitude should be pretty permanent.
I'm torn on plant_name_eia
. It's mostly for human readability / recognition. I guess it does very occasionally change though, and we use it for matching FERC and EIA plants to each other programmatically, and in that context I guess having the updated names could be helpful in the rare instances where they change.
I started writing out a comment about where we draw the boundary between "small-t transforms" vs "big-T Transforms" as they relate to de-noising quasi-static attributes and reconciling conflicting data sources (aka "harvesting"), but then realized that we don't keep anything except the final outputs, so that boundary is currently irrelevant.
As it stands now, these modeling decisions feel high-stakes because all assumptions are final, data loss is permanent, and a differently-opinioned user's only recourse is to start from scratch with cumbersome raw data files (cost prohibitive). Ultimately I think a lot of this problem can be avoided if we store light-touch intermediate data in the database, probably in a separate schema namespace.
I guess this comment is more about the harvesting refactor than this particular issue, but essentially I think we should keep the full time resolution, noisy data somewhere in the database and offer a simplified version for convenience.
But that is in the future. For today's column assignment problem, I agree with your allocations. I'm tempted to put plant name and address in the time varying table, but can see how that extra information probably comes at a cost of annoying most users who don't really care about the greatest accuracy there.
Ultimately I think a lot of this problem can be avoided if we store light-touch intermediate data in the database, probably in a separate schema namespace.
This is definitely the longer term plan! But we have to get the Dagster stuff implemented and start building a real data warehouse.
I'm tempted to put plant name and address in the time varying table
I was very tempted to put plant_name_eia
in the annual table, and I think we should ultimately because we use it for matching with FERC so tracking real name changes will be helpful. But this will definitely have downstream impacts that I don't think are within the scope of the current changes.
In working on the EIA fuel price imputation (See #1708) it became clear that the EIA Sector ID assigned to a plant does change occasionally (e.g. when a plant is sold by a regulated utility to an IPP / merchant generator). This is one of the fields used in reporting aggregated fuel prices in the EIA API, which contains information about redacted prices.
To more accurately reflect this slow-changing variable and enable reproduction of the EIA aggregations, move
sector_id_eia
andsector_name_eia
from theplants_entity_eia
table (which contains fixed attributes) to theplants_eia860
table, which contains annual attributes.While I'm in here shuffling these fields around.... here are the fixed plant attributes:
Others that I think should probably be allowed to vary annually, due to the fact that their territories shift over time:
balancing_authority_code_eia
balancing_authority_name_eia
iso_rto_code
Other possibilities...
primary_purpose_id_naics
-- as withsector_id_eia
it seems like this could change when plants are bought & sold. Seems like this should definitely be annual, if the high level EIA sector code is, since the EIA sector is derived from this ID.ferc_exempt_wholesale_generator
-- probably doesn't change often, but seems like it can change occasionally.ferc_small_power_producer
,ferc_cogen_status
-- Determined by FERC Form 556, probably wouldn't ever go back once the status has been assigned, but might not have always been a qualifying facility.grid_voltage_*
-- do grid ties change / get upgraded over time?@TrentonBush do you have thoughts?