catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
471 stars 108 forks source link

Investigate integrating SEC 10-K EX-21 subsidiary owner info with PUDL #2225

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Many utilities identified in FERC, EIA, and EPA data are subsidiaries owned by some larger utility holding company. Understanding these ownership relationships can be helpful in understanding the economic incentives of utilities, e.g. which electric utilities are intimately linked to natural gas companies through being owned by the same parent.

@MichaelTiemannOSC pointed me at CorpWatch which has scraped data describing these relationships from the SEC 10-K filings, where it shows up in Exhibit 21, aka EX-21. They make the data freely available for download in bulk or via a RESTful API. The 10-K is reported in XBRL, so it might be possible to download the bulk data directly from SEC and create our own database.

MichaelTiemannOSC commented 1 year ago

Do let me know when you attempt to process Constellation Energy Corp's 2022 report from whatever sources you can find. I'm looking at their 2022 Sustainability Report (https://www.constellationenergy.com/content/dam/constellationenergy/pdfs/Constellation-2022-Sustainability-Report.pdf), where I find both a claim of delivering over 1.6 dekatherms of natural gas in 2021 (page 21), as well as a Scope 3 category 11 emissions disclosure totaling 69,126 kt CO2e out of a total reported Scope 3 of 93,647 kt CO2e (page 59). The problem?

ureg('1.6e9 dekatherm').to('kt CO2e')
<Quantity(84653.60969816365, 'CO2e * kt')>

Either they know how to deliver gas that burns 22% more efficiently than EPA knows about, or ??? Will be very interested to see how this ties to other disclosures and/or mass/energy/volume identities.

cmgosnell commented 1 year ago

Hokay. here is some learnings from a preliminary investigation:

the original PDF forms

From what i can tell, the 10k filer report all of their subsidiaries and the percentage ownership for each of those subsidiaries. seems pretty good!

Corpwatch Access

it's pretty simple! they have bulk downloads of both CSV and MySQL versions. I've only looked into the CSVs, but it seems like they both have the same structure. There is also api access! Which i assume would be nice for more one-off questions.

Columns shared with EIA

This is actually a fair amount of information for a record linkage since location is so specific.

Data Coverage

It looks like some tables go back to 2001 but most go back to 2003 👍🏻

The one bummer in here is that it doesn't seem like the ownership % is reported in this corpwatch bulk download :-(

zaneselvans commented 1 year ago

That is really too bad about the percentage ownership not being available. 😢

Looking at the CorpWatch API GitHub repo, it seems like most of the codebase is like 13+ years old, and written in Perl. I wonder how far automated PDF parsing has come since then, and how hard it would be to re-write this system to use e.g. Layout Parser (GitHub repo) which @TrentonBush and @katie-lamb have been talking about for that other project. Especially if there's a well defined document structure that's shared across all of the EX-21 forms?

Edit: of course there is no standard document layout. That would be too easy. A few examples:

zaneselvans commented 1 year ago

There's also TabulaPDF, but it hasn't made a release since 2018.

zaneselvans commented 1 year ago

The Corpwatch DB looks like it tracks all the companies -- both parents and subsidiaries -- and counts up how many parents each subsidiary has, and how many subsidiaries each parent has, so even without the actual percentages there's some measure of how concentrated ownership is. It might be that in most cases the utility subsidiaries are wholly owned, so you could assume it's just 100%.

Big caveats on the data quality though -- it might be that misspellings or name changes mean that the counts of parents / subsidiaries are off, since they aren't using unique universal IDs.

Depending on what fraction of these relationships are simple for the electric utilities, it might be tractable to try and fill in the missing ones by hand. Or just the number of owners of a subsidiary could be taken as a gross estimate of how distributed the ownership is (assume each of 2 owners owns 50%, each of 3 owners owns 33% etc.) Wouldn't be great, but would be a lot easier than trying to parse out the wide variety of tables with nesting relationships that respondents are using to report this information.

zaneselvans commented 1 year ago

@jrea-rmi do you want to talk about what we might do next?

MichaelTiemannOSC commented 1 year ago

FWIW, there are PRs for Pint (https://github.com/hgrecco/pint/pull/1615) and a Pint-Pandas branch (https://github.com/MichaelTiemannOSC/pint-pandas/tree/ducks-unlimited) that make it possible to use uncertainties, so you can make them 50/50 +/- 50%.

jrea-rmi commented 1 year ago

In our reverse engineering of ERM with EIA860 ownership information, we see that ~90% of power plant owners are wholly owned by a single parent. About a third of the multiple owner cases are 50-50 ownership.

image

utility_id_eia_to_parent_name.csv

In our use of these relationships, we do need the fraction owned so we can calculate the amount of generation and emissions attributable to a ticker. 90% is pretty good, but I definitely don't want to go backwards with data quality from what we already have.

The main benefits I see in CorpWatch are:

Overall at the moment, it'd be amazing to have a source I'm not convinced that the data quality or coverage will be good enough that we'll use it. It's hard to guess what that will be until record linkages have been attempted. So if you're feeling optimistic then I'm interested in seeing what comes of your attempt, but if not then maybe this stays on the backburner...

jrea-rmi commented 1 year ago

A summary comment: I'm very curious, but the lack of % ownership is an issue and I'd want to understand the coverage after record linkage better (measured by the fraction of capacity/generation/emissions covered) and if there are gaps in coverage, would Catalyst be willing to maintain overrides/additions to sub-parent ownership information?

zaneselvans commented 1 year ago

If the data you currently have indicates that 90% of the owners that show up in EIA have a single parent, and we can confirm that the Corpwatch data agrees (based on linking SEC companies to EIA companies), it seems like checking the remaining 10% with multiple owners manually might be tractable. It would be several hundred records that need to be updated once a year right?

So it seems like the potential weak point is the record linkage between the EIA-860 utilities and the SEC companies. The location data is an extremely specific discriminator, and so potentially very helpful in doing the linkage, but it seems like it's probably a legal address (HQ, incorporation address of record, mailing address, etc), rather than a permanent physical location like we have with the plants, so we might get a significant number of false negatives -- where different addresses have been reported to SEC vs. EIA for some reason. Many record linkage systems are designed with deduplicating addresses in mind, so they should work well when the reported addresses represent the same place even if they're not totally identical. I don't think we'll know to what extent the addresses (or names) being reported to the two agencies are just flat out different without trying to do the linkage. But @katie-lamb and @cmgosnell have gotten good at doing initial drafts of the record linkages.

jrea-rmi commented 1 year ago

yes, around 500 records to check per year