catalyst-cooperative / pudl

The Public Utility Data Liberation Project provides analysis-ready energy system data to climate advocates, researchers, policymakers, and journalists.
https://catalyst.coop/pudl
MIT License
456 stars 105 forks source link

Parallelize extraction of Excel spreadsheets #2385

Closed zaneselvans closed 8 months ago

zaneselvans commented 1 year ago

Reading data from Excel spreadsheets is slow and a lot of our inputs are published in this format, so it ends up being a bottleneck at the beginning of our ETL process. However, it should be readily parallelizable, since each data source is typically partitioned into many different spreadsheets which can all be read at the same time before the resulting dataframes are concatenated together. Nearly all of our Excel data is distributed in annual partitions, many of which also contain multiple files.

Data sources we currently extract from Excel are all partitioned annually and include:

- [x] `eia860` (2001-2021) + `eia860m`? (the `raw_eia860_assets` asset group)
- [x] `eia861` (2001-2021)
- [x] `eia923` (2001-2021) (the `raw_eia923_assets` asset group)

Future data sources that will also benefit from this work include:

If we want to rely on Dagster to manage the parallel extraction of these data sources, then it seems likely that each of these partitions would become its own Asset, with another Asset downstream that does the work of concatenating them all together and returning a single dataframe. With ~20 years of data, that would provide enough work units to keep all the cores busy on a single machine.

If it seemed worthwhile, we could further decompose the problem so that each individual file in each of the partitions was its own asset.

Implementation Ideas

bendnorman commented 1 year ago

Another option is to create a @graph_multi_assetfrom a dynamic graph that spawns a new op to process each year, then collects and concats the outputs. See @zschira's work in #2472 for an example of creating assets from dynamic graphs.

bendnorman commented 11 months ago

Hey @ggurjar333! This is a good first dagster related issue.

ggurjar333 commented 11 months ago

Hi @bendnorman I'm on it!

zaneselvans commented 11 months ago

There's a draft PR #2673 that implements the same parallelization method for the EIA-861 and EIA-923, but there were some differences in the data outputs that we needed to investigate before merging it in. See this comment