cagov / data-infrastructure

CalData infrastructure
https://cagov.github.io/data-infrastructure
MIT License
5 stars 0 forks source link

Investigate orchestration options #4

Open ian-r-rose opened 1 year ago

ian-r-rose commented 1 year ago

We intend to do most transformation in our data warehouse(s) via dbt, but there is still need for scheduled loads of custom data. So while a full DAG framework might be overkill at this point, some sort of workflow orchestration tool is worthwhile.

Some requirements:

  1. Execute arbitrary Python, bash scripts
  2. User-interface for viewing scheduled runs, failures, metadata
  3. Notifications upon failures
  4. Retries
  5. encrypted secrets

Nice-to-have:

  1. Execute R scripts
  2. Python virtualenv isolation
  3. Managed option (I don't want to personally keep some servers up)
ian-r-rose commented 1 year ago

Orchestration options

This a fairly basic comparison of a bunch of the available options for workflow orchestration. There are also some more closed-off proprietary solutions from AWS, GCP, etc (e.g., Glue) that I haven't really evaluated. They often try to be low-code and serverless. I tend to be a bit more skittish around single cloud options.

A high-level note: Our general approach right now is to prefer an ELT-style workflow over an ETL-style one. So while these orchestration tools allow for quite complex DAGs, I would tend to treat them as more simple hosted services for running small loading scripts on a schedule. This means that some of the neat, more advanced features of these things would be less relevant to our (initial) deployments.

Airflow

Airflow is the oldest and most popular orchestration tool that is still widely used today.

Pros

Cons

There are at least three managed offerings of Airflow available from major vendors:

GCP Cloud Composer

AWS MWAA

Astronomer

Prefect Cloud

Dagster Cloud

Feature Comparisons

Option Managed Environment isolation Python bash R Secrets Manager Notifications Web UI Local Dev Tooling dbt integration
GCP Composer (Airflow) Yes GKE pods or python virtualenvs Yes Yes Sort of Yes Hand-rolled Yes Yes Yes Yes
AWS MWAA (Airflow) Yes ECS, EKS, or virtualenv Yes Yes Sort of Yes Hand-rolled Yes Yes Yes
Astronomer (Airflow) Yes KubernetesOpdOperator, virtualenvs Yes Yes Sort of Yes Hand-rolled Yes Yes Yes
Self-managed Airflow No KubernetesPodOperator, virtualenv Yes Yes Sort of Yes Hand-rolled Yes Hand-rolled Yes Yes
Prefect Cloud Yes Kubernetes pods and Docker containers Yes Yes Coming soon? Yes Yes Yes Yes Yes
Dagster Cloud Yes Kubernetes pods, ECS, Docker containers Yes Yes No Yes Yes Yes Yes Yes

Would be particularly interested in hearing @jasonlally's thoughts about the above.

jasonlally commented 1 year ago

@ian-r-rose - this is great! Thanks for putting this together.

Some thoughts:

What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.

jasonlally commented 1 year ago

As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.

ian-r-rose commented 1 year ago

I like that dagster and prefect don't need hand rolled notifications

I don't want to stress it too much, since I think setting up AWS SES or sendgrid for an airflow deployment isn't too much work. That said, I was a bit surprised to read that Astronomer doesn't do this for you, as it seems like a fairly simple value-add. But perhaps I'm not understanding their docs correctly.

As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.

That's good to hear. I previously self-managed an airflow deployment, and it was a significant amount of work. The developer experience around these managed offerings has improved a lot.

What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.

Sure, I think that would be instructive. One idea for a test plan could be to load the Microsoft building footprints dataset, as there are a few things that make it a moderately challenging job which might flush out issues:

  1. It likely involves a custom software environment (i.e., something with the GDAL stack)
  2. It's on the larger size (i.e., may require provisioning larger instances, possibly even horizontal scalability)
  3. It changes somewhat regularly
  4. We may want several destinations (BQ, snowflake, parquet in S3)
  5. We know that (at least) DOF is interested in this dataset
jasonlally commented 9 months ago

This is relevant to @melanie-logan working on data loading options.

@ian-r-rose should we close this since Melanie is working on evaluation now? Or I guess not since we still want to do more eval on orchestration. We can keep open, but makes sense to me to reassign.

ian-r-rose commented 9 months ago

Makes sense to me!

melanie-logan commented 9 months ago

Yes, I can do a separate orchestration Eval. Thanks!