Closed zaneselvans closed 1 year ago
Another option is to create a @graph_multi_asset
from a dynamic graph that spawns a new op to process each year, then collects and concats the outputs. See @zschira's work in #2472 for an example of creating assets from dynamic graphs.
Hey @ggurjar333! This is a good first dagster related issue.
Hi @bendnorman I'm on it!
There's a draft PR #2673 that implements the same parallelization method for the EIA-861 and EIA-923, but there were some differences in the data outputs that we needed to investigate before merging it in. See this comment
Reading data from Excel spreadsheets is slow and a lot of our inputs are published in this format, so it ends up being a bottleneck at the beginning of our ETL process. However, it should be readily parallelizable, since each data source is typically partitioned into many different spreadsheets which can all be read at the same time before the resulting dataframes are concatenated together. Nearly all of our Excel data is distributed in annual partitions, many of which also contain multiple files.
Data sources we currently extract from Excel are all partitioned annually and include:
Future data sources that will also benefit from this work include:
phmsagas
(1970-2021)eiawater
(2014-2021)If we want to rely on Dagster to manage the parallel extraction of these data sources, then it seems likely that each of these partitions would become its own
Asset
, with anotherAsset
downstream that does the work of concatenating them all together and returning a single dataframe. With ~20 years of data, that would provide enough work units to keep all the cores busy on a single machine.If it seemed worthwhile, we could further decompose the problem so that each individual file in each of the partitions was its own asset.
Implementation Ideas
pudl.extract.excel.GenericExtractor.extract()
method to produce a DagsterAssetGroup
with one asset per requested partition, and another asset that depends on each of those partitions and combines them into a single dataframe.