JustFixNYC / nycdb-k8s-loader

Loading and updating of NYC-DB data via containerized batch processing.
6 stars 2 forks source link

Consider using Airflow/Dagster instead of (or in conjunction with) this #30

Open toolness opened 5 years ago

toolness commented 5 years ago

I recently found out about Airflow which does a lot of the stuff this tool does--it has many more features, but is also more complex. The main feature that would be nice to have is the concept of storing tasks as Directed Acyclic Graphs (DAGs), which allows for the concept of dependencies--this would be especially good for the WoW dataset, which requires a lot of prerequisite datasets. Right now we manage it simply by scheduling the WoW dataset to be generated a few hours after its prerequisites, but it'd be nice to have something more robust.

toolness commented 5 years ago

The Data Engineering Podcast #43 mentions that the ideal use of Airflow is in conjunction with Kubernetes, so it's possible we could reuse a lot of this repository's code with it!

toolness commented 4 years ago

Another option is Dagster, which addresses a lot of flaws in Airflow. More context is in this 45-minute presentation from October 2019.

jameslmartin commented 3 years ago

Hey @toolness - I've been following y'alls work at JustFix.nyc for about a year now and thought I might chime in. I think Apache Airflow would be a great tool for orchestrating these workflows. Writing DAGs might be a lift to start, but I think the flexibility they give you may provide more value in the long run. I would also suggest checking out Google Cloud's managed Airflow product called Cloud Composer. I'm not sure about budget/pricing, and I'm not a GCP expert, but I have had a great experience migrating jobs from a home-grown Airflow instance to Composer. The real headache with Airflow is operating/monitoring it in production, which a managed instance takes care of. It seems like you've run into issues with things silently failing, not retrying, etc. - a managed Airflow instance will likely help with that.

Another major benefit of managed Airflow is you can upgrade the version of Airflow fairly easily - giving you access to new Airflow operators as they come out. Newer versions of Airflow also support a Kubernetes Operator which gives you some flexibility to run your DAGs as k8s pods. It looks like you all already use k8s to run jobs, so perhaps there is some good overlap there.

A workflow I found that worked well was to set up CI between a GitHub repository and the Composer instance using GitHub Actions. Composer will automagically pick up new jobs by scanning an object store and searching for DAG files. You can sync the contents of the repo with a Google object bucket and viola - your DAGs show up in Composer and will run when you specify them to!

A downside I would consider - though I don't know the scale of y'alls data - would be that a managed Airflow instance may require more CPU than you initially expect. For example, if you're parallelizing an ETL job and Composer spins up 5 k8s pods to perform the work, that could get costly fast. On the other hand, a managed instance will provide monitoring and alerting so you'll have insight into that behavior.

Happy to chat more if you want to brainstorm.

toolness commented 3 years ago

Oh cool, thanks @jameslmartin, this is very helpful information! Currently the job is parallelized via AWS Fargate, and I think it's not super cheap, but it gets done quickly. We also have a bunch of free AWS Credits, apparently, which makes AWS effectively cheaper than Google Cloud for us (unless of course we could snag some free GCP credits I guess).

Right now "silent failure" is no longer an issue since we added Rollbar integration in #54, so at least that's taken care of. But we still don't have automatic retries in place, and there may be other things Airflow/Dagster can help us with too.

At the moment we don't have bandwidth to iterate on this solution, but hopefully we will in the new year, and I'll loop back with you then. Thanks again for your thoughts!