catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Scope XBRL FERC Format Conversion #1440

Closed cmgosnell closed 2 years ago

cmgosnell commented 2 years ago

Research XBRL ecosystem, FERC's usage of XBRL, and develop a forward plan for integrating XBRL based FERC filings into PUDL.

zschira commented 2 years ago

XBRL Background

I'm far from an expert on XBRL at this point, but I think I understand the basic concepts well enough to work towards extracting relevant data. Importantly, an XBRL instance is composed of facts. A fact is considered to be an atomic piece of data. It contains a value and all information needed to interpret that value (concept, unit, time period). A taxonomy then describes relationships between facts and provides some structure to the data.

Tools

Arelle

Arelle seems to be the only particularly mature open source solution for interacting with XBRL. It was recently acquired by Workiva, and they claim it won't become proprietary. If this is true it could mean more support and consistent financial backing. Arelle provides a CLI, a GUI, and direct access through an API. No matter what method we use, it will most likely need to be scripted as Arelle is really not made for interacting with more than one filing at a time. For this reason, I think the API will probably be the most direct way to do this.

Arelle API

The API doesn't have much documentation, but digging around I think I've figured out enough to make use of it. I've figured out how to directly access the taxonomy, and the fact lists of individual filings. With access to both of these, I should be able to move forward with integrating the XBRL filings into the ETL.

Other options

There are many other tools for working with XBRL, but most of them are some combination of proprietary, targeted at helping companies doing filing, and focused on SEC data. XBRL-US seems to be the biggest player in the XBRL ecosystem, and they do provide several options for accessing FERC data, but only with a paid membership.

Integration

It seems that the easiest way to integrate new XBRL based data with the old Foxpro based data would be to develop a method for extracting data and mapping it to the SQLlite db created by ferc1_to_sqlite. The taxonomy released for FERC form 1 very closely maps tables to the pages of the raw form, so it should be able to map this data to said SQLite db. As a starting place, I plan to try and implement this mapping for the tables currently being used by PUDL. From here we can attempt some data verification before working on the mapping for the rest of the tables.

zschira commented 2 years ago

Moving from scoping to integration