catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
468 stars 107 forks source link

Investigate slow speed of XBRL parsing #1557

Closed zschira closed 2 years ago

zschira commented 2 years ago

Currently parsing XBRL is taking a very long time (30+ minutes to process a single year's worth of filings). From some preliminary profiling, it appears Arelle is accounting for most of that time, but more detailed analysis is necessary.

zschira commented 2 years ago

I've switched to using lxml to parse individual XBRL instances, and only using Arelle for parsing the FERC taxonomy, which is much more complex. This has drastically improved performance (went from over 30mins to just a few to process a full year of filings). Closing.