Open ian-r-rose opened 1 year ago
This a fairly basic comparison of a bunch of the available options for workflow orchestration. There are also some more closed-off proprietary solutions from AWS, GCP, etc (e.g., Glue) that I haven't really evaluated. They often try to be low-code and serverless. I tend to be a bit more skittish around single cloud options.
A high-level note: Our general approach right now is to prefer an ELT-style workflow over an ETL-style one. So while these orchestration tools allow for quite complex DAGs, I would tend to treat them as more simple hosted services for running small loading scripts on a schedule. This means that some of the neat, more advanced features of these things would be less relevant to our (initial) deployments.
Airflow is the oldest and most popular orchestration tool that is still widely used today.
Pros
Cons
There are at least three managed offerings of Airflow available from major vendors:
Option | Managed | Environment isolation | Python | bash | R | Secrets Manager | Notifications | Web UI | Local Dev Tooling | dbt integration | |
---|---|---|---|---|---|---|---|---|---|---|---|
GCP Composer (Airflow) | Yes | GKE pods or python virtualenvs | Yes | Yes | Sort of | Yes | Hand-rolled | Yes | Yes | Yes | Yes |
AWS MWAA (Airflow) | Yes | ECS, EKS, or virtualenv | Yes | Yes | Sort of | Yes | Hand-rolled | Yes | Yes | Yes | |
Astronomer (Airflow) | Yes | KubernetesOpdOperator, virtualenvs | Yes | Yes | Sort of | Yes | Hand-rolled | Yes | Yes | Yes | |
Self-managed Airflow | No | KubernetesPodOperator, virtualenv | Yes | Yes | Sort of | Yes | Hand-rolled | Yes | Hand-rolled | Yes | Yes |
Prefect Cloud | Yes | Kubernetes pods and Docker containers | Yes | Yes | Coming soon? | Yes | Yes | Yes | Yes | Yes | |
Dagster Cloud | Yes | Kubernetes pods, ECS, Docker containers | Yes | Yes | No | Yes | Yes | Yes | Yes | Yes |
Would be particularly interested in hearing @jasonlally's thoughts about the above.
@ian-r-rose - this is great! Thanks for putting this together.
Some thoughts:
What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.
As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.
I like that dagster and prefect don't need hand rolled notifications
I don't want to stress it too much, since I think setting up AWS SES or sendgrid for an airflow deployment isn't too much work. That said, I was a bit surprised to read that Astronomer doesn't do this for you, as it seems like a fairly simple value-add. But perhaps I'm not understanding their docs correctly.
As a side note, I found Astronomer to be really great last I used it. They really focus on developer experience and have some nice useful cli tools that wrap around airflow and make managing deployments easier.
That's good to hear. I previously self-managed an airflow deployment, and it was a significant amount of work. The developer experience around these managed offerings has improved a lot.
What do you think of doing an intentional test drive on a simple workflow? If we do that, let's hash out a test plan before starting any assessment.
Sure, I think that would be instructive. One idea for a test plan could be to load the Microsoft building footprints dataset, as there are a few things that make it a moderately challenging job which might flush out issues:
This is relevant to @melanie-logan working on data loading options.
@ian-r-rose should we close this since Melanie is working on evaluation now? Or I guess not since we still want to do more eval on orchestration. We can keep open, but makes sense to me to reassign.
Makes sense to me!
Yes, I can do a separate orchestration Eval. Thanks!
We intend to do most transformation in our data warehouse(s) via dbt, but there is still need for scheduled loads of custom data. So while a full DAG framework might be overkill at this point, some sort of workflow orchestration tool is worthwhile.
Some requirements:
Nice-to-have: