Dagster refactor design

Our current dagster refactor plan doesn't describe how to persist interim tables and how to apply dagster concepts to the output and analysis layers.

Data Persistence questions

Is it possible for multiple processes to write to a sqlite file one at a time? If not, we'll need to create a postgres db most likely using docker.

It seems like this is possible:
Multiple processes can have the same database open at the same time. Multiple processes can be doing a SELECT at the same time. But only one process can be making changes to the database at any moment in time, however. SQLite uses reader/writer locks to control access to the database.
How to persist interim data with schemas? The current metadata system needs all tables schemas to be created at the same time.

If we create one pudl.sqlite db per run_id, we can have the IO manager create the database and all of the schemas if the file hasn't been created yet.
Will performance be an issue when we have smaller ops? There is overhead in launching a process for every op.

There will be a 2 - 4 second overhead for launching ops/assets in their own process. Hopefully, we'll recoup the performance losses by running things in parallel.
What is the proper way to use a database as an IO manager? Based on my understanding, the default IO Manager creates a storage directory for each run. How can we create a database for each run? Is this what we want?

Here is an example of an IO Manager that writes to a snowflake DB. It looks like it just clobbers the tables on each run. If we continue to use SQLite dbs we can create a db for each run. This way we don't have to handle a clobber argument. This means folks will have databases piling up if they run the ETL a bunch of times. I figure this is ok because people can just delete old run outputs. Here is an example of how to write a simple IO Manager that writes dataframes to a database.
If IO Managers aren't the best way to persist data to a database, what is a sensible way to write interim tables? A MaterializeAssets and a function that writes a table to a db? Is it sketchy for the interim table in the db and the dataframe to diverge? How should we manage the schemas for interim tables?

See above

Output layer questions

Which output tables need CEMS data?

output.epacems() and analysis.epa_crosswalk() access CEMS data as dask dataframes. For the Parquet IO Manager we'll need to handle writing and reading pandas and dask dataframes.
How can we give the output and analysis tables access to the ETL outputs? Sensors? Passing ETL op outputs to output ops (this might require merging everything into one job)?
It seems like keeping everything in one job, using IO managers and SDAs makes it really easy to define new output tables based on ETL outputs. ETL outputs can are written to a sqlite database as assets which can be accessed by output and analysis assets:
```
@asset
def gens_output_table(generators_eia860: pd.DataFrame, generators_entity_eia: pd.DataFrame) -> pd.DataFrame:
...
return new_output_table_df
```
This is great because you don't need to construct the dependencies between these tables explicitly. The table names are plucked out of the dagster asset namespace and loaded using the IO manager.
How would SDA work with SQL views?

Can you define an asset that doesn't return anything? It looks like you can! We can define an asset that loads tables from the db IO Manager, and creates a database view but doesn't return anything. The assets that create views will need access to a database Resource.

The Design

To reiterate some of our goals with dagster:

Persist our interim, foundational, output and analysis tables to a database so data can be consumed without having to use the PudlTabl object. Having output tables stored in the database and served through datasette and intake will make our data way more usable.
Parallelize independent processes. For example, Ferc and EIA extraction can happen in parallel.
Develop a common interface to partially integrate new data sources and add new analyses.
Offload some basic ETL functionality to dagster: ETL configuration, IO, and data dependency management.
Create visual documentation of data dependencies.

My initial iterations of using dagster for PUDL mostly used ops and graphs. Wrapping pudl functions in these abstractions felt like we were adding an additional layer of complexity, see discussion in #1835. Wrapping functions in ops also didn't provide a clear way to persist interim, output, and analysis tables to the database. These issues lead me to dagster's other paradigm, software-defined assets.

Pros of software-defined assets:

Assets provide a uniform method for constructing data dependencies and persisting data to storage. To create a new asset, all you have to do is create a function, specify the dependencies as parameters and return a new data frame. Dagster automatically constructs the dependencies based on the names of the assets and can write the data to a database.
Asset DAGs provide visual documentation of how tables and the functions that create the data relate to one another.

My plan is to convert each pudl function that extracts or cleans a dataframe to an asset or multi_asset. For example, extraction functions will be wrapped in multi_assets and produce an asset for each raw dataframe. Most individual table transform functions will become assets because they can depend on multiple tables and produce a single asset. There will be no need for loading functions because IO Managers will handle the loading to storage (sqlite and parquet).

The Plan

This refactor will touch everything and it will take a while! To avoid the new branch and dev from getting super out of sync I propose we apply these changes to two phases. The first phase will convert the ETL processes to assets, and the second phase will convert the output and analysis tables. This will make an already monster PR and a little less monster. I believe converting just the ETL initially won't be an issue because the PudlTable class is pretty distinct from the ETL.

Here are some rough steps for the first phase:

Convert EIA ETL. Implement and test a SQLite IO Manager.
Convert FERC ETL.
Convert EPA CEMS ETL. Implement and test a Parquet IO manager.

At each step, I'd love to get feedback to make sure I'm building the right stuff! Once people approve the EIA ETL changes, I'll start to update the test suite to accommodate the dagster changes.

Clarifying Questions

How to ensure outputs of mutli_assets are in the correct order?
- We will have some multi_asset definitions in our DAG. For example, the extraction and harvest functions will return an arbitrary number of tables. The multi_asset decorator requires you to define the outs as a dictionary of asset name to AssetOut() object. The multi_asset returns a tuple of objects which must match the order of the AssetOut() objects defined in the decorator. If you wrap the returned tuple values in Output() objects with names, dagster will throw an error if any of the tuple values are out of order.
- We’ll need a way to generate the AssetOut directory for multi_assets that return dozens of tables, ie the Form 1 extraction step (so we don’t have define all of the AssetOuts manually in the decorator). A potential solution is to add a new field to our metadata with the name of the function that produces the asset.
- For the first iteration I can specify the outputs manually. Harvesting and EIA extraction steps don’t return that many outputs. For Form 1 extraction, I'll need to get a list of tables from the metadata.
How to preserve dtypes as we read and write to the database.
- IO Managers need to implement the handle_output and handle_input methods. Moving data between SQLite and pandas will likely result in dtypes inconsistencies.
- How to ensure datatypes are persevered when writing to the database? The table schemas will have already been created so any incorrect dtypes will produce a sqlite constraint error when loaded to the database. We can also construct a mapping of column names to sqlalchemy types using the Resource.from_id() method to pass to the dtypes arg of the pd.DataFrame().to_sql() method.
- How to ensure datatypes are persevered when reading from the database. Pandas does a poor job of maintaining the correct dtypes when reading from a database. Luckily we have the apply_pudl_dtypes() helper function which looks up a data frame’s dtypes in our metadata and converts the columns to the correct dtypes. This is great for reading the fundamental tables but the fields in our interim and output tables aren’t guaranteed to be in our metadata right now. We either have to add metadata for our output and interim tables or rely on pandas convert_dtypes() function.
SQLite read and write performance
- We’ll have hundreds of assets in our DAG moving between pandas and the database. This will be an expensive computational task! Sqlalchemy sends chunks of insert statements to the database, which is slower than writing dataframes to files. Even if we weren’t using dagster, we’d have to write interim and output tables to the database using sqlalchemy. I don’t think this will negatively impact the experience of developing with PUDL. Each asset will be run in a separate process, so reading upstream assets from the database and transforming the data can happen in parallel. However, only one process can write to sqlite at a time so that will be a potential bottleneck. Also, we’ll be able to run a subset of the dag so we don’t have to rerun expensive steps like the EIA extraction.
How to handle foreign keys?
- Each asset function will run in a separate process in the order determined by the data dependencies in the DAG. This does not guarantee assets will be loaded into the database in the correct order to satisfy foreign key constraints.
- After talking to our technical advisors and looking through dagster examples, it doesn’t seem like folks enforce foreign keys in their data warehouses. The concept of a foreign key that links one table to another is used, but it’s generally not enforced as a constraint as it is in application databases. However, applications like DBT run referential integrity checks in their test suite once all of the tables have been materialized.
- I propose we disable foreign key constraints so we don’t have to worry about fk errors as we rematerialize tables and create a data validation test that checks fk relationships. SQLite has a foreign_key_check pragma that can flag foreign key failures in existing tables. This will allow us to rematerialize parent tables in foreign key relationships. The downside of this solution is we won’t get real time feedback on fk failures. The FK tests should be included in our CI and nightly builds.
How to organize interim tables, outputs tables?
- Currently, we only save our normalized “foundational” tables to the database. Converting our ETL to using assets will allow us to save partially cleaned tables and output tables to a database. We’ll need a way to organize all of these new tables. Unlike most databases, SQLite doesn’t support schemas. We could develop a naming convention for each type of table. For example, we could have prefixes for each type of table: “raw”, “output”, “analysis_”. Another option is to create separate databases for each type of table. This would make the table distinctions clearer but might make it more difficult to consume the data if users want to access all of the table types. Maybe our Intake catalog can point to multiple sqlite files?
How to configure assets?
- Currently, we can specify the subset of years and tables we want the ETL to process. When everything is an asset you can’t really configure the DAG to produce specific tables other than rerunning subsets of the DAG. Datasource like EIA 860 and 923 that need the same subset of years to be properly loaded should be extracted using the same configured multi_asset function. If the extraction steps for 860 and 923 were separate, you could run into a scenario where the database has different sets of years for each data source which would mess up the harvesting process. Extracting them together ensures downstream ops are always working with the same set of years.
How to create database views as assets?
- Eventually, we’ll have assets that create views in the database. Typically, assets read and write asset inputs and outputs, but this isn’t necessary when creating views. To avoid wasting time writing and reading data frames to and from the database, we can have assets that create views, return None and specify Non-argument dependencies. This allows us to specify asset dependencies without reading data from the database.

catalyst-cooperative / pudl