attempt a record linkage between EIA utilities and CorpsWatch's SEC filers

cmgosnell commented 1 year ago

This is a next step out of the preliminary investigation from #2225

In scope

[x] use one-off downloaded & extracted CorpsWatch EX-21 data to make a utility table to compare to EIA
- [x] probably have multiple records per utility for each address-type
[x] prep an EIA utility table with cooresponding columns to match based on
[x] setup a simple unsupervised rl process
[x] key results needed:
- [x] how many SEC filers are matched to EIA utils?
- [x] how many EIA utils are matched to SEC filers?
- [x] Spot check a few dozen for accuracy

Out-of Scope

[ ] archiving CorpsWatch's Exhibit 21 data
[ ] integrating the EX-21 data into pudl
[ ] generating training data

katie-lamb commented 1 year ago

I did an initial pass at record linkage with just 2005 data and here are some questions I'm left with. Pretty much zero tuning went into this model so results are, as expected, nowhere near perfect.

I'm using splink and it provides some helpful charts to understand the modeling process.

This first chart shows the match weight for each of the "comparison levels". For example, if the utility names of two records have a Levenshtein edit distance < 2, then it will be given a positive match weight of ~8 for the utility name comparison. If the city names are very different, it will be given a very negative match weight.

This next chart shows how records that the model deems to be a match are distributed amongst the comparison levels. Of note here is that of the matching records, ~97% have a utility name Levenshtein edit distance > 5. Additionally, of the matching records, 98% have an exact match on city. I'll note here that the "match set" that the model is using here isn't actually its predictions. It's based on a set of conditions that I gave it that I think should be true for matching records. I estimated the recall with these rules to be 70%. These rules definitely need some tuning to make that "match set" better.

Takeaways:

utility name isn't being weighted high enough?
- My intuition was that utility name should likely be within close edit distance for matching records. But maybe I'm wrong. There are a lot of situations it seems where two records have the same street address, but the name is very different. Maybe I misunderstood the context of the problem, but I thought we were trying to match parent and owner companies, and 90% of the time it's one to one. I guess the matching companies might have very different names?
- My intuition of placing a high weight on utility name matching is also incorporated in my blocking rules. So the blocking currently doesn't reflect the match weights, which means the model isn't performing very well.
city and zip code are dependent variables
- I think this is actually kind of a big issue with this first baseline model. As long as the city and zip aren't filled in erroneously, then this is often a one to one mapping. I believe this could mean that a correct city and zip match are being weighted too high, when street address is actually more of the distinguishing location variable.

With just the 2005 data, the model matches about 60% of SEC companies and 10% of EIA companies (assuming that the threshold for a correct match is a score of .5 or greater). There are far fewer SEC companies so this is sort of expected.

I think with some pretty basic tuning, results will get much better.

zaneselvans commented 1 year ago

I'm not sure what you mean by this:

Maybe I misunderstood the context of the problem, but I thought we were trying to match parent and owner companies, and 90% of the time it's one to one.

I may not have enough context here but IIRC, the CorpWatch database has a table that's all of the companies, covering both the parents and subsidiaries, and I thought that the record linkage problem we wanted to do was to find matches between that big SEC list of companies and the EIA list of utilities, without (initially) concerning ourselves about which role they're playing.

With that linkage, we'd hopefully be able to join the SEC/CorpWatch company ID into the utilities_entity_eia or utilities_eia860 table, and use the relationships between parents & subsidiaries reported in the Corpwatch DB (based on their SEC/CorpWatch company ID) to aggregate up EIA ownership or operator data to the parent company level.

Why did you choose 2005 to start with? I could imagine the addresses associated with utilities changing a fair bit from year to year, and it might be that they change at different times in the two datasets (or even that they report entirely different addresses to SEC vs. EIA -- incorporation location in Delaware vs. operational HQ?). Might it make sense to try doing the record linkage without considering year to start with? Just get a deduplicated list of all the companies in both datasets with all the names and addresses they've ever reported?

IIRC, we are not currently harvesting utilities that only show up as owners (See #1393) which means you'll need to compile your own mall-encompassing list of potential utility names and addresses, based on both the ownership_eia860 and utilities_eia860 tables.

On the Zip codes, I would think that it would be common for one city to have many zip codes, and rare for one zip code to have multiple cities (probably only in really rural places), but I'm sure both of them happen. But I would expect each full address to have an almost perfectly 1:1 relationship with zip code (unless the zip code boundaries got moved, which happens occasionally I think).

I bet there are some standard address normalization libraries out there we could use. Or I think USPS has an API for standardizing addresses too, but it might be annoying to use / rate-limited.

I agree with your intuition that the company names should be highly weighted. You could have lots of companies registered to the same (or almost the same) address, as often happens with PO Boxes in Delaware.

But at the same time, I think a lot of parents and subsidiaries will have similar names, Like look at how many utility names contain the word "Duke" in the EIA data. I think the exact match + Levenshein distance <= 2 criteria might be too stringent, or not the right way to do it. I'm sure there are misspellings, but I think differences will more frequently arise from things lik Inc vs. Incorporated vs. Corp. (or lack of Inc at all), LLC vs. Limited or Ltd. Maybe some address-style normalization of these words makes sense?

zaneselvans commented 1 year ago

@katie-lamb @cmgosnell @jrea-rmi Some work from Climate Trace that might be interesting in the context of this issue:

Write-up on their attempt to figure out who owns which xenoforming stations
Methodology (PDF)
Data downloads as CSV (view by country, and then download the US non-forestry data to get electricity system data, it's behind a registration wall).
More details on the electricity system methodology

Seems like it could be more granular and complete, at least in the US:

Electricity Generation: Assets were defined at the level of individual power plants. Ownership was defined in terms of percent financial interest in the asset as a piece of real property, a business concern, state-owned enterprise, or government agency. Ownership data for 552 plants was derived from the GEM Wiki’s Global Coal Plant Tracker (GCPT) and Global Gas Plant Tracker (GGPT) (GEM, 2022). To ensure ownership coverage for the top 500 emitting assets, ownership information for 13 plants was derived through desk research. In some cases, there were different owners for specific units within an asset. In these cases, the unit level ownership data was aggregated to the asset level by summing the ownership of each unit in the asset weighted by its capacity. See the Climate TRACE electricity methodology for more information on plant capacity. The ‘owner name’ in the Climate TRACE dataset was derived from the ‘owner’ field on the GEM wiki where available.

jrea-rmi commented 1 year ago

That's probably a significant fraction of electricity sector emissions, but doesn't get us owned generation mwh and will miss a lot of utilities that we'd do analysis on. So unfortunately probably not a new complete resource we could use

katie-lamb commented 1 year ago

I may not have enough context here but IIRC, the CorpWatch database has a table that's all of the companies, covering both the parents and subsidiaries, and I thought that the record linkage problem we wanted to do was to find matches between that big SEC list of companies and the EIA list of utilities, without (initially) concerning ourselves about which role they're playing.

Got it. This makes sense. The names actually will match up then.

Why did you choose 2005 to start with? I could imagine the addresses associated with utilities changing a fair bit from year to year, and it might be that they change at different times in the two datasets (or even that they report entirely different addresses to SEC vs. EIA -- incorporation location in Delaware vs. operational HQ?). Might it make sense to try doing the record linkage without considering year to start with? Just get a deduplicated list of all the companies in both datasets with all the names and addresses they've ever reported?

This was mostly because if you ignore year as a blocking rule, there are a ton of essentially duplicate records on both sides that I thought would potentially mess with matching. And I wanted instantaneously fast results lol. Agree that a good next step is going to be deduplicating both datasets. In the SEC data there’s a column for min year and max year, which I assume represent when there was a change in address, and for what years that address is applicable. There’s also a year column that falls within that that range. I thought it might work to just match this year column with the EIA records, but agree that some potentially bad address reporting makes this not a great strategy.

IIRC, we are not currently harvesting utilities that only show up as owners (See #1393) which means you'll need to compile your own mall-encompassing list of potential utility names and addresses, based on both the ownership_eia860 and utilities_eia860 tables.

Ah interesting, I didn’t realize this.

I agree with your intuition that the company names should be highly weighted. You could have lots of companies registered to the same (or almost the same) address, as often happens with PO Boxes in Delaware.

But at the same time, I think a lot of parents and subsidiaries will have similar names, Like look at how many utility names contain the word "Duke" in the EIA data. I think the exact match + Levenshein distance <= 2 criteria might be too stringent, or not the right way to do it. I'm sure there are misspellings, but I think differences will more frequently arise from things lik Inc vs. Incorporated vs. Corp. (or lack of Inc at all), LLC vs. Limited or Ltd. Maybe some address-style normalization of these words makes sense?

Ya I agree that Levenshtein maybe isn’t the right metric here. I did some very basic string cleaning and normalization on the addresses, but definitely more normalization is an easy next step.

jrea-rmi commented 1 year ago

@katie-lamb if you haven't seen it, the OS-Climate CompanyNameCleaner has some useful tools for cleaning company names

https://github.com/os-climate/financial-entity-cleaner

katie-lamb commented 1 year ago

I made a few quick changes to the preprocessing and reran the model:

Included the owner names from the PUDLownership_eia860 table that aren't in the utility_eia860 table.
Used the OS-Climate CompanyNameCleaner to clean the company names (LLC -> Limited Liability Company), did another round of cleaning on addresses (St. -> Street).
Deduplicated the records on utility name, address, city, and zip instead of filtering by year. After deduplication, the year attribute is ignored during matching.
Tried to improve blocking and model m and u parameter training to create a higher match weight for utility name.
Switched to using Jaccard similarity instead of Levenshtein edit distance to compare strings.

Results were slightly better, but still too much weight is put on city and not enough on utility_name. This is something that could be tuned in the blocking/parameter training step by playing with the comparison levels and metrics that are made between records. Additionally, I think I need to use a string distance approach that looks at the corpus of names as a whole, and recognizes that words like "limited", "liability", "company", "corporation", etc. are very common and does a term frequency adjustment for this.

Here are the charts comparing match weight. As you can see, still too much weight is put on city and not enough on utility name.

zaneselvans commented 1 year ago

Could we use TF-IDF to vectorize the utility names (or other text fields) and cosine similarity to compare them? Or is the menu of similarity metrics hard coded as part of Splink?

katie-lamb commented 1 year ago

Ya that's a good question and something I'm trying to figure out with the CCAI work. The hardcoded similarity metrics are Jaccard, Jaro-Winkler, and Levenshtein, but they're subclasses of a generic Distance Metric class that I can use to implement a cosine similarity. I'm not sure how it would actually perform with the model. I think it could work.

Here's a recent issue about this in the splink repo where the maintainer writes out an outline for how to implement this.

zaneselvans commented 1 year ago

The Inverse Document Frequency (IDF) part of TF-IDF deals with the presence of common (and thus not very important) words like "Limited" and "Corp" nicely. If the matching is being impacted more by word-level differences than myriad misspellings, maybe word-level tokenization rather than length-N substrings would be good enough, in which case it wouldn't blow up memory in the same way that it does in the FERC Plant ID assignments?

cmgosnell commented 1 year ago

I'm closing this as it was a good first pass to get us to understand what a full integration and linkage would entail. Next steps coming soon & will link back to this PR

catalyst-cooperative / pudl

attempt a record linkage between EIA utilities and CorpsWatch's SEC filers #2337

In scope

Out-of Scope