catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

utility "owners" are not harvested into normalized EIA utility table #1393

Closed cmgosnell closed 5 months ago

cmgosnell commented 2 years ago

There are about 1000 utilities that are currently not captured in the utilities_eia table through the harvesting process because they only show up as "owners" in the ownership_eia860 table.

# with a pudl_out object
own = pudl_out.own_eia860()
utils = pudl_out.utils_eia860()

len(own[~own.owner_utility_id_eia.isin(utils.utility_id_eia.unique())].owner_utility_id_eia.unique())
1009

This is causing me trouble because I am using the ownership table in the EIA plant-part list to generate plant-part records for each owner. This results in ~3% of the plant-part records not having utility_id_pudl because I was assuming that all of the utilities were in utilities_eia. I think these owner records do have utility_id_pudl because we are capturing them in the PUDL id mapping process - but they are not actually linked to them because this link happens through the utilities_eia table.

zaneselvans commented 2 years ago

I feel like we've come across something like this before.

This seems like a specific case of a more general issue that I suspect we'll run into other places -- that it's possible to have multiple columns in a table which have foreign keys referring to the same other table -- because the same entities can have multiple different kinds of relationships with each other. So somehow in the normalization process we need to be able to indicate which columns are associated with what entities, even when they don't have the canonical column names. But then we'll also have to deal with the other attribute columns in those tables, and identifying which of the multiple PK columns they're supposed to be associated with.

Also, doesn't the original utilities_eia860 table come from a spreadsheet tab that lists all of the utilities and indicate whether they are reporting as owners or operators? I would have thought that would catch all of these. Maybe that's a recent development? Looking at the distribution of missing IDs across years, it seems like it starts being a problem around 2011, and ramps up from there. I wonder if these utilities are mostly part owners of renewable plants?

cmgosnell commented 2 years ago

In this case I actually don't think it is that messy of a PK/FK relationship

utilities_pudl
  * utility_id_pudl
      |
      V
  utilities_eia
    * utility_id_eia
    * utility_id_pudl (FK from utilities_pudl)
       |
       V
    ownership_eia860
      * utility_id_eia (FK from utilities_eia.utility_id_eia)
      * owner_utility_id_eia (FK from utilities_eia.utility_id_eia)

Codifing the need to harvest both the utility_id_eia and the owner_utility_id_eia seems potentially challenging.

There is a owner and operator question in the main utilities table from EIA 860, but it appears this still does not capture all of the owners.

On your question about how this has changed over time:

(
    own[~own.owner_utility_id_eia.isin(utils.utility_id_eia.unique())]
    .groupby(['report_date'])[['owner_utility_id_eia']].nunique()
)
           | owner_utility_id_eia
report_date 
2003-01-01 | 5
2010-01-01 | 1
2011-01-01 | 80
2012-01-01 | 158
2013-01-01 | 238
2014-01-01 | 326
2015-01-01 | 396
2016-01-01 | 502
2017-01-01 | 538
2018-01-01 | 607
2019-01-01 | 721
2020-01-01 | 820

Looking at the issue within the plant-part list, it looks like the majority of them are renewables, but there is a smattering across the board: image

Regardless of the plant type, we should probably come up with a general purpose solution for this.

zaneselvans commented 2 years ago

I wonder if a simple renaming of the columns would fix this in the near term. The plant-operator relationship should already be captured in the plants_eia860 table. If we renamed owner_utility_id_eia to utility_id_eia and utility_id_eia to operator_utility_id_eia then the ownership relationship would be captured / harvested, and we could drop the operator_utility_id_eia column from the table altogether, allowing us to partially address #1266 as well.

Like if we're going to fail to harvest some data from the table, we should probably choose to fail on data we know we should be getting somewhere else already.

cmgosnell commented 2 years ago

I think this would be a fine - if not somewhat janky - solution. It definitely would be outside of our current convention of what a utility_id_eia is in most tables with any plant-level information.

I'm personally less concerned about this a short term issue... it is annoying to work around right now but not the end of the world.

What I think would need to happen for a more durable solution in the new harvesting process: for every column that is to be harvested, the default behavior is that is collects columns with the exact same name from any table that shares the harvested table's primary keys. But a possibility is to collect columns with any number of names from tables that share the harvested table's primary key.

zaneselvans commented 2 years ago

Maybe this is what you're suggesting, but it should be possible to harvest selectively based on name and the foreign key relationships. Like any column that is already identified as referring to an entity table column (like utility_id_eia) even if it doesn't have exactly the same name as the column it refers to (e.g. owner_utility_id_eia) should still be considered a source for harvested values. Would that be a good idea?

I thought the renaming was janky initially but now I'm not so sure. Does it make sense to have the operator utility_id_eia (which isn't part of the PK in this table) be the value that gets harvested -- or that determines which other values are associated with that ID based on the other columns in the ownership_eia860 table?

cmgosnell commented 2 years ago

Yes, that is what I was thinking. If the entity table column can be harvested from more than one column name, in this case, we could grab both the utility_id_eia and the owner_utility_id_eia from the same table. Hm, I'm just realizing that this column in particular is a part of the primary key for this entity (the utility!)

On the renaming.. i mostly feels janky because this would go against the naming convention for the utility that we have in all other tables. In any other table where a utility id and a plant are present, the utility id is the operator. In this table, that would be different.

Right now, this table is only plant/generator IDs, utility (operator) ID/info and utility (owner) ID/info. I guess I could imagine just fully dropping the utility (operator) ID/info.... but this goes against our desire to STOP dropping columns before they get harvested. Really both the operator and the owner are utilities and should be collected as such.

zaneselvans commented 1 year ago

@knordback This issue is adjacent to the changes you made recently on the #509 branch.

cmgosnell commented 5 months ago

@katie-lamb is this one donezo now? i think yes