catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
481 stars 110 forks source link

Parallelize extraction of Excel spreadsheets #2385

Closed zaneselvans closed 1 year ago

zaneselvans commented 1 year ago

Reading data from Excel spreadsheets is slow and a lot of our inputs are published in this format, so it ends up being a bottleneck at the beginning of our ETL process. However, it should be readily parallelizable, since each data source is typically partitioned into many different spreadsheets which can all be read at the same time before the resulting dataframes are concatenated together. Nearly all of our Excel data is distributed in annual partitions, many of which also contain multiple files.

Data sources we currently extract from Excel are all partitioned annually and include:

- [x] `eia860` (2001-2021) + `eia860m`? (the `raw_eia860_assets` asset group)
- [x] `eia861` (2001-2021)
- [x] `eia923` (2001-2021) (the `raw_eia923_assets` asset group)

Future data sources that will also benefit from this work include:

If we want to rely on Dagster to manage the parallel extraction of these data sources, then it seems likely that each of these partitions would become its own Asset, with another Asset downstream that does the work of concatenating them all together and returning a single dataframe. With ~20 years of data, that would provide enough work units to keep all the cores busy on a single machine.

If it seemed worthwhile, we could further decompose the problem so that each individual file in each of the partitions was its own asset.

Implementation Ideas

bendnorman commented 1 year ago

Another option is to create a @graph_multi_assetfrom a dynamic graph that spawns a new op to process each year, then collects and concats the outputs. See @zschira's work in #2472 for an example of creating assets from dynamic graphs.

bendnorman commented 1 year ago

Hey @ggurjar333! This is a good first dagster related issue.

ggurjar333 commented 1 year ago

Hi @bendnorman I'm on it!

zaneselvans commented 1 year ago

There's a draft PR #2673 that implements the same parallelization method for the EIA-861 and EIA-923, but there were some differences in the data outputs that we needed to investigate before merging it in. See this comment