Prefect merge discussion

bendnorman commented 2 years ago

This issue is to discuss how we want to the work in the Prefect branch #901 in.

Questions

1. What are our top infrastructure issues? Does the Prefect branch get us closer or further to addressing those issues? Some of these issues are being enumerated in #1401.

Here are a couple of our issues and how the prefect branch addresses them:

Cache steps of the pipeline:
- Currently, extracting excel files seems like an unnecessary bottleneck. A majority of the EIA processing is spent in pd.read_excel(). Given the raw excel files do not change, caching this extraction step and any other long-running processing tasks would dramatically speed up development and testing.
- Prefect: The prefect branch allows you to set checkpoints to determine which tasks should be cached. For example, if you wanted to cache the extraction step you would update the @task(checkpoint=False) parameter for transformation tasks (all downstream tasks will also not be cached). This is a bit awkward right now but it's a step in the right direction.
Data processing uniformity
- We don't have a common recipe for adding and processing datasets. EIA and Ferc 1 are extracted in different ways, and it is unclear where certain data cleaning should happen, in the transform methods, output, or analysis tables? How do we add new datasets?
- Prefect: The branch introduces the abstract class DatasetPipeline. The class requires you to implement the dataset settings and a build() method that adds processing tasks to the main flow. I think this is a decent start and could be expanded. Maybe instead of having a catch-all build method, a pipeline should implement extract, transform and load methods? The questions regarding how to extract data and where to apply transformations are not addressed by the branch.
Ability to handle large datasets (EQR)
- EPACEMS takes about 1:20 mins to run on most of our machines. This is decent for our purposes but won't work for larger datasets like FERC EQR.
- Prefect: Prefect allows us to easily parallelize data processing. This branch provides a small boost in CEMS performance (~20 min).
Access to interim tables
- There is some desire to access partially processed data tables for analyses. (Which tables?)
- Prefect: The branch wasn't designed to handle writing interim tables to an accessible/reusable location.

2. If we are to merge the prefect branch in, what would be the next steps?

Create tasks for each table instead of each dataset.
Incorporate validation into the etl run (not prefect related?).
Orchestrate the output and analysis tables with prefect. We should probably clean up the post ETL pipeline before adding prefect.
Continue to make our processing more uniform.
Try running it in the cloud on a single machine and cluster.
Setup prefect server.

zaneselvans commented 2 years ago

Another potential benefit of Prefect is generating an explicit set of dependencies between different products. This could help with debugging and understanding what parts of the overall system are affected by changes. It could also allow us to re-run only those parts of the data processing that need to be re-run to generate new outputs, based on what's been changed.

The pre-database set of dependencies isn't super complex at the moment, but as the data get recombined in more complex ways further downstream it seems hard to keep track of what changes impact which other parts.

zaneselvans commented 2 years ago

Also, as we get more datasets that are unrelated to each other -- at least early on in the pipeline -- being able to process them in parallel should speed things up considerably. There's no reason for the early stages of the EIA 860/861/923/176 or FERC 1/2/EQR or PHMSA data to have to wait on each other. They should all be able to run in parallel until you get to the steps where they're interacting (e.g. entity resolution for EIA data).

bendnorman commented 2 years ago

Good point, Prefect produces very useful visual documentation of dependencies.

The pre-database set of dependencies isn't super complex at the moment, but as the data get recombined in more complex ways further downstream it seems hard to keep track of what changes impact which other parts.

Are you referring to output and analysis tables? If so, using prefect to orchestrate the creation of those tables makes sense.

bendnorman commented 2 years ago

Not to throw a wrench in this decision but I spent some time reading about another python orchestration tool called Dagster. It seems much more data engineering focused where as Prefect tries to be as flexible as possible.

Was Dagster ever considered when we set out to parallelize our ETL? If so, why did we choose Prefect instead?

I’ve highlighted some interesting features that I don’t think Prefect addresses.

Interesting features

Asset lineage: Dagster can automatically generate linage information that describe how assets depend on one another.
Run specific steps: You can easily rerun the DAG from a specific step.
Rerun steps based on code changes: Dagster can use versions to determine whether or not it is necessary to re-execute a particular step. Given versions of the code from each op in a job, the system can infer whether an upcoming execution of a step will differ from previous executions by tagging op outputs with a version. This allows for the outputs of previous runs to be re-used.
Type Checking: Dagster provides an optional typing system for inputs and outputs of @ops (tasks). Can also yield metadata / summary statistics about the inputs and outputs. Support data validation for pandas data frames.
Process Configs: “In many situations, we’d like to be able to configure ops at run time. For example, we may want whomever is running the job to decide what dataset it operates on. Configuration is useful when you want the person or system that is launching a job to be able to make choices about what the job does, without needing to modify the job definition.” You can specify a schema for a configuration. This is essentially our settings management with pydantic.

Other

Unlike Prefect, the UI, orchestration and core are one tool. Setting up the Dragster UI and core was trivial. However, Prefect Orion will combine Prefect Core, Server and UI into one tool.
Documentation on how to deploy Dagster in various environments.
Dagster provides a lot of examples including a full example project.

zaneselvans commented 2 years ago

Are you referring to output and analysis tables? If so, using prefect to orchestrate the creation of those tables makes sense.

Yes, the MCOE, the Plant Plarts Plist, the fuzzily-merged FERC+EIA data, the utility + balancing authority service territories, the hourly state-level electricity demand, etc -- basically any "stock" analysis that we create that generates a generally useful tabular output should eventually be getting written to a DB, so that anyone can use it without need to run code -- they can just install it using one of the data catalogs.

zaneselvans commented 2 years ago

A good video overview of Dagster.

After spending an hour with the tutorial and video walkthrough it seems really similar to Prefect, but with stronger opinions and more defined structures. But those opinions and structures seem to be pretty directly aimed at our patterns of use.

Another think I liked was how easy they made it seem to swap out local disk, database, or object store persistence, so testing, CI, cloud versions of the same run could all be easily done only swapping out the persistence layer / objects.

I like the focus on typing, enumerating, validating, tracking of the assets that are produced as much as or even more than the DAG itself. They seem more like a necessary glue in Prefect, but not a focus.

The tracking of how assets evolve over time is also attractive -- being able to see how the number of rows in a table has changed, and the schema of the table would be really useful.

What I really want is an opinionated tool for our use case with good opinions since left to our own devices we'll probably come up with bad opinions, since we aren't experts.

Another thing I liked was the ability to swap out different persistence layers so that local testing / CI / deployment can use the same data processing code, with the saving of assets decoupled from the rest of the system.

bendnorman commented 2 years ago

We have decided to move forward with Dagster. See epic #1487 for the full reasoning.

catalyst-cooperative / pudl