catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

attempt a record linkage between EIA utilities and CorpsWatch's SEC filers #2337

Closed cmgosnell closed 1 year ago

cmgosnell commented 1 year ago

This is a next step out of the preliminary investigation from #2225

In scope

Out-of Scope

katie-lamb commented 1 year ago

I did an initial pass at record linkage with just 2005 data and here are some questions I'm left with. Pretty much zero tuning went into this model so results are, as expected, nowhere near perfect.

I'm using splink and it provides some helpful charts to understand the modeling process.

This first chart shows the match weight for each of the "comparison levels". For example, if the utility names of two records have a Levenshtein edit distance < 2, then it will be given a positive match weight of ~8 for the utility name comparison. If the city names are very different, it will be given a very negative match weight.

Screen Shot 2023-03-06 at 2 18 34 PM

This next chart shows how records that the model deems to be a match are distributed amongst the comparison levels. Of note here is that of the matching records, ~97% have a utility name Levenshtein edit distance > 5. Additionally, of the matching records, 98% have an exact match on city. I'll note here that the "match set" that the model is using here isn't actually its predictions. It's based on a set of conditions that I gave it that I think should be true for matching records. I estimated the recall with these rules to be 70%. These rules definitely need some tuning to make that "match set" better.

Screen Shot 2023-03-06 at 2 18 52 PM

Takeaways:

With just the 2005 data, the model matches about 60% of SEC companies and 10% of EIA companies (assuming that the threshold for a correct match is a score of .5 or greater). There are far fewer SEC companies so this is sort of expected.

I think with some pretty basic tuning, results will get much better.

zaneselvans commented 1 year ago

I'm not sure what you mean by this:

Maybe I misunderstood the context of the problem, but I thought we were trying to match parent and owner companies, and 90% of the time it's one to one.

I may not have enough context here but IIRC, the CorpWatch database has a table that's all of the companies, covering both the parents and subsidiaries, and I thought that the record linkage problem we wanted to do was to find matches between that big SEC list of companies and the EIA list of utilities, without (initially) concerning ourselves about which role they're playing.

With that linkage, we'd hopefully be able to join the SEC/CorpWatch company ID into the utilities_entity_eia or utilities_eia860 table, and use the relationships between parents & subsidiaries reported in the Corpwatch DB (based on their SEC/CorpWatch company ID) to aggregate up EIA ownership or operator data to the parent company level.

Why did you choose 2005 to start with? I could imagine the addresses associated with utilities changing a fair bit from year to year, and it might be that they change at different times in the two datasets (or even that they report entirely different addresses to SEC vs. EIA -- incorporation location in Delaware vs. operational HQ?). Might it make sense to try doing the record linkage without considering year to start with? Just get a deduplicated list of all the companies in both datasets with all the names and addresses they've ever reported?

IIRC, we are not currently harvesting utilities that only show up as owners (See #1393) which means you'll need to compile your own mall-encompassing list of potential utility names and addresses, based on both the ownership_eia860 and utilities_eia860 tables.

On the Zip codes, I would think that it would be common for one city to have many zip codes, and rare for one zip code to have multiple cities (probably only in really rural places), but I'm sure both of them happen. But I would expect each full address to have an almost perfectly 1:1 relationship with zip code (unless the zip code boundaries got moved, which happens occasionally I think).

I bet there are some standard address normalization libraries out there we could use. Or I think USPS has an API for standardizing addresses too, but it might be annoying to use / rate-limited.

I agree with your intuition that the company names should be highly weighted. You could have lots of companies registered to the same (or almost the same) address, as often happens with PO Boxes in Delaware.

But at the same time, I think a lot of parents and subsidiaries will have similar names, Like look at how many utility names contain the word "Duke" in the EIA data. I think the exact match + Levenshein distance <= 2 criteria might be too stringent, or not the right way to do it. I'm sure there are misspellings, but I think differences will more frequently arise from things lik Inc vs. Incorporated vs. Corp. (or lack of Inc at all), LLC vs. Limited or Ltd. Maybe some address-style normalization of these words makes sense?

zaneselvans commented 1 year ago

@katie-lamb @cmgosnell @jrea-rmi Some work from Climate Trace that might be interesting in the context of this issue:

Seems like it could be more granular and complete, at least in the US:

Electricity Generation: Assets were defined at the level of individual power plants. Ownership was defined in terms of percent financial interest in the asset as a piece of real property, a business concern, state-owned enterprise, or government agency. Ownership data for 552 plants was derived from the GEM Wiki’s Global Coal Plant Tracker (GCPT) and Global Gas Plant Tracker (GGPT) (GEM, 2022). To ensure ownership coverage for the top 500 emitting assets, ownership information for 13 plants was derived through desk research. In some cases, there were different owners for specific units within an asset. In these cases, the unit level ownership data was aggregated to the asset level by summing the ownership of each unit in the asset weighted by its capacity. See the Climate TRACE electricity methodology for more information on plant capacity. The ‘owner name’ in the Climate TRACE dataset was derived from the ‘owner’ field on the GEM wiki where available.

jrea-rmi commented 1 year ago

That's probably a significant fraction of electricity sector emissions, but doesn't get us owned generation mwh and will miss a lot of utilities that we'd do analysis on. So unfortunately probably not a new complete resource we could use

katie-lamb commented 1 year ago

I may not have enough context here but IIRC, the CorpWatch database has a table that's all of the companies, covering both the parents and subsidiaries, and I thought that the record linkage problem we wanted to do was to find matches between that big SEC list of companies and the EIA list of utilities, without (initially) concerning ourselves about which role they're playing.

Got it. This makes sense. The names actually will match up then.

Why did you choose 2005 to start with? I could imagine the addresses associated with utilities changing a fair bit from year to year, and it might be that they change at different times in the two datasets (or even that they report entirely different addresses to SEC vs. EIA -- incorporation location in Delaware vs. operational HQ?). Might it make sense to try doing the record linkage without considering year to start with? Just get a deduplicated list of all the companies in both datasets with all the names and addresses they've ever reported?

This was mostly because if you ignore year as a blocking rule, there are a ton of essentially duplicate records on both sides that I thought would potentially mess with matching. And I wanted instantaneously fast results lol. Agree that a good next step is going to be deduplicating both datasets. In the SEC data there’s a column for min year and max year, which I assume represent when there was a change in address, and for what years that address is applicable. There’s also a year column that falls within that that range. I thought it might work to just match this year column with the EIA records, but agree that some potentially bad address reporting makes this not a great strategy.

IIRC, we are not currently harvesting utilities that only show up as owners (See #1393) which means you'll need to compile your own mall-encompassing list of potential utility names and addresses, based on both the ownership_eia860 and utilities_eia860 tables.

Ah interesting, I didn’t realize this.

I agree with your intuition that the company names should be highly weighted. You could have lots of companies registered to the same (or almost the same) address, as often happens with PO Boxes in Delaware.

But at the same time, I think a lot of parents and subsidiaries will have similar names, Like look at how many utility names contain the word "Duke" in the EIA data. I think the exact match + Levenshein distance <= 2 criteria might be too stringent, or not the right way to do it. I'm sure there are misspellings, but I think differences will more frequently arise from things lik Inc vs. Incorporated vs. Corp. (or lack of Inc at all), LLC vs. Limited or Ltd. Maybe some address-style normalization of these words makes sense?

Ya I agree that Levenshtein maybe isn’t the right metric here. I did some very basic string cleaning and normalization on the addresses, but definitely more normalization is an easy next step.

jrea-rmi commented 1 year ago

@katie-lamb if you haven't seen it, the OS-Climate CompanyNameCleaner has some useful tools for cleaning company names

https://github.com/os-climate/financial-entity-cleaner

katie-lamb commented 1 year ago

I made a few quick changes to the preprocessing and reran the model:

Results were slightly better, but still too much weight is put on city and not enough on utility_name. This is something that could be tuned in the blocking/parameter training step by playing with the comparison levels and metrics that are made between records. Additionally, I think I need to use a string distance approach that looks at the corpus of names as a whole, and recognizes that words like "limited", "liability", "company", "corporation", etc. are very common and does a term frequency adjustment for this.

Here are the charts comparing match weight. As you can see, still too much weight is put on city and not enough on utility name.

Screen Shot 2023-03-14 at 8 48 27 PM Screen Shot 2023-03-14 at 8 48 34 PM
zaneselvans commented 1 year ago

Could we use TF-IDF to vectorize the utility names (or other text fields) and cosine similarity to compare them? Or is the menu of similarity metrics hard coded as part of Splink?

katie-lamb commented 1 year ago

Ya that's a good question and something I'm trying to figure out with the CCAI work. The hardcoded similarity metrics are Jaccard, Jaro-Winkler, and Levenshtein, but they're subclasses of a generic Distance Metric class that I can use to implement a cosine similarity. I'm not sure how it would actually perform with the model. I think it could work.

Here's a recent issue about this in the splink repo where the maintainer writes out an outline for how to implement this.

zaneselvans commented 1 year ago

The Inverse Document Frequency (IDF) part of TF-IDF deals with the presence of common (and thus not very important) words like "Limited" and "Corp" nicely. If the matching is being impacted more by word-level differences than myriad misspellings, maybe word-level tokenization rather than length-N substrings would be good enough, in which case it wouldn't blow up memory in the same way that it does in the FERC Plant ID assignments?

cmgosnell commented 1 year ago

I'm closing this as it was a good first pass to get us to understand what a full integration and linkage would entail. Next steps coming soon & will link back to this PR