catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Use new CsvExtractor to refactor existing EPA CEMs and FERC 714 extraction #3451

Open e-belfer opened 3 months ago

e-belfer commented 3 months ago

Is your feature request related to a problem? Please describe. In #3402 we implemented a new CsvExtractor class in pudl.extract.csv, subclassing a generic Extractor. We should update both of our existing extractors to use this new format.

Describe the solution you'd like For pudl.extract.ferc714 and pudl.extract.epacems, we should transition to subclassing the new CsvExtractor class instead of using ad-hoc functions. These should result in the same exact outputs, but use the new CsvExtractor infrastructure.

Describe alternatives you've considered Retain existing bespoke extractors.

zaneselvans commented 3 months ago

Bonus points on this issue if you can figure out a way to dramatically reduce the memory intensity of the EPA CEMS CSV extraction. It's currently a huge bottleneck, and means we can only process 2 EPA CEMS assets at a time, which ends up being the thing that controls how long the overall ETL takes.