catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Explore FERC's new XBRL data format #1321

Closed aesharpe closed 2 years ago

aesharpe commented 2 years ago

FERC has updated it's filing practices! Now, instead of using FoxProDB, they are using XBRL files. They've dumped a bunch of the old files into this format, so I think it makes sense to explore those and figure out how to read them so we're ready for next year when there is no more FoxPro.

From an email with Robb Hudson from FERC: Contact: Robert.Hudson@ferc.gov

FERC has adopted the XBRL standard, based on XML, for Form 1, 2, 6, 60 & 714 (and their “sub-forms – quarterlies and some non-major forms). We developed taxonomies for each of these forms and have stopped collecting data in Visual FoxPro as of 10/1/2021. You can view the current taxonomy at our Taxonomy Review & Comment Tool (and comment on the code there if you are so inclined), check out the eCollection filing portal to see all the accepted filings (more taxonomy information is here too), and read a lot about the entire project at the eForms Webpage.

The simple answer to all your questions is that everything has changed. We are no longer providing software to filers to prepare their submissions, just the taxonomy and related codesets and a vendor community has created a marketplace of software applications for filers to use (FERC has no influence over this market). I don’t understand what you mean by “radio buttons or dropdown menus", but the entire ruleset of standardization is in the taxonomy – I suggest you look at that in the Taxonomy Review & Comment Tool first.

And you will be pleased to know that the actual XBRL instance files that filers submit will be available to download and you can develop your own XBRL database (using the taxonomy) from there. Everything is machine readable, standardized, and modern (for lack of a better word). If you are unfamiliar with XBRL data, I recommend you reach out to XBRL.US – they helped us in this effort.

We do not have any crosswalks to EIA records, though, I am certain that this new XBRL data will make that instantaneous. We have also published 10 years of data that you can find on that eForms webpage above under “Migrated data downloads” on the left.

From an email with David Tauriello from XBRL.us: Contact: david.tauriello@xbrl.us

In addition to the resources we’ve posted on our site for FERC filers, we’re working to include this data set in our Database of Public Filings.

Another good resource for learning about the data would be FERC’s eForms Refresh page

zaneselvans commented 2 years ago

"I am certain that this new XBRL data will make that instantaneous" 🤣

I find it hard to imagine that they are going to retroactively apply any kind of structure onto 10 years of incredibly messy data that is impossible to parse programmatically, so I imagine that at best the new data going forward will be clean, the last 10 years of messy data will be available in XBRL, and the 17 years of data before that will only be available through Visual FoxPro. So I suspect that whatever cleaning we're doing for the years up to 2020 will remain relevant.

aesharpe commented 2 years ago

Ya, I actually responded asking what he meant by "instantaneous" and he said:

"Because our data is now in XBRL – a standard for data – with the right tools and knowledge, it can easily be linked to just about any other dataset."

And then I asked him whether they had an EIA crosswalk and he said no....

zaneselvans commented 2 years ago

Some other useful XML / XBRL links I've come across so we don't lose them (XBRL is a particular flavor of XML):

MichaelTiemannOSC commented 2 years ago

Having spent a few weeks looking at the SEC DERA data (which originates in XBRL), the big caveat I would offer is that there seems to be very little foreign key enforcement. I've had some exchanges with the Structured Data Office (of the SEC) and they informed me about the public channel they use to comment on data quality issues they observe: https://www.sec.gov/structureddata/osdstaffobsandguide

It's all well and good to have a syntactic validator that ensures that files are parseable. But what will be very important is keeping an appropriate tight leash on how XBRL submissions remain within the guidelines of the data model and taxonomies, and that we don't see a flowering of 1000 different descriptions of the same fundamental dat type (which, though discouraged, is permissible in the SEC's world, and readily observed). For example, in the first quarter of each calendar year, over 4000 companies report their market cap (public float), and another 1000+ disclose in the other three quarters:

bash-3.2$ grep -c EntityPublic 2020q?/num.txt
2020q1/num.txt:4171
2020q2/num.txt:875
2020q3/num.txt:475
2020q4/num.txt:457

But one company reports EntitysPublicFloat:

bash-3.2$ grep EntitysPublic 20??q?/num.txt
2020q4/num.txt:0001213900-20-034148     EntitysPublicFloat      0001213900-20-034148            20200630        0       BRL     59342000.0000   
2020q4/num.txt:0001213900-20-034148     EntitysPublicFloat      0001213900-20-034148            20190630        0       BRL     53802000.0000   

Which, though permitted, is actually erroneous, and would be caught with proper validation.

zschira commented 2 years ago

Arelle seems to be the de facto (as much as there is one) open source standard for working with XBRL. FERC also provides plugins for rendering and validation using Arelle.

Arelle also provides a plugin for conversion to a SQL database. It does the conversion in a highly generalized way that seems difficult to work with, but this may be a route we could take. Arelle does also have a python api that could be useful, but the documentation is sparse.

zschira commented 2 years ago

Closing see 1530 for status