cagov / data-infrastructure

CalData infrastructure
https://cagov.github.io/data-infrastructure
MIT License
5 stars 0 forks source link

Orchestration bake-off design #138

Closed ian-r-rose closed 11 months ago

ian-r-rose commented 1 year ago

Here is a draft design of an orchestration test:

orchestration_bakeoff

The basic flow is as follows:

  1. Load the MS building footprints dataset into our data warehouse. This starts out partitioned by state, and we'll probably load each state into a separate file. This will be done by the orchestration tool.
  2. Load TIGER census shapes into the dataset. This will be done by the orchestration tool.
  3. Load California cities and counties from the CA. This will be done by the orchestration tool.
  4. Union the MS building footprints dataset. This will be done by dbt.
  5. Join the unioned dataset with FIPS codes as well as city and county. This will be done by dbt.
  6. Generate an ad-hoc report from the final dataset. Something like "Find all the building footprints in these three counties and determine whether they intersect with this fire district".
  7. Email that report as a PDF attachment.
ian-r-rose commented 1 year ago

There are a few things I would look for in an orchestration tool when conducting a workflow like this. Along each of these dimensions, I would score them from "makes me happy" to "makes me sad".

Custom software environments

The GIS stack often requires custom software environments. That is to say, whatever default image the tool uses will not do the job, so we'll need to provide our own image. We'll want to evaluate how easy it is to build our own image and provide it to the orchestration tool.

Compute resources

Do they provide a compute cluster or other batch-like service for running custom jobs? Or will we need to bring our own K8s/ECS/Batch cluster? If the latter, how easy is it to set up?

Integration with AWS/GCP services

If we do have services running in our own cloud account, how easy is it to interact with them? Are there nice user interfaces for securely providing service account credentials?

API access and CI integration

How painful is it to deploy new versions of a pipeline? Are there CI tools or custom GitHub actions for doing this? Ideally it is simple to deploy-on-merge.

dbt Integration

This is one of the most important: all of the major orchestrators have been implementing some level of integration with dbt (which provides its own DAG abstraction). How pleasant are these integrations to use?

ian-r-rose commented 1 year ago

Also, it should go without saying that one of the overriding concerns for the test is "how easy is it to set up and maintain the infrastructure?".

If we want to hand off orchestration pipelines to clients, let's try to choose a tool with the highest probability of success.

ian-r-rose commented 1 year ago

I'll also want to compare these with an orchestrator-free approach that uses something like AWS Glue or AWS Batch jobs, and then kicks them off with GitHub actions crons. Approaches like these would have less infrastructure to manage, and would be cheaper to run. The downside is that you couldn't do any DAG creation, and you lose out on the ecosystem of tools that these orchestration tools bring with them.

britt-allen commented 1 year ago

The design looks straightforward to me and the considerations you have seem sound.

britt-allen commented 1 year ago

Also, it should go without saying that one of the overriding concerns for the test is "how easy is it to set up and maintain the infrastructure?".

If we want to hand off orchestration pipelines to clients, let's try to choose a tool with the highest probability of success.

Given this, what alternatives to dbt might we consider if handoff is to a team without SQL chops, but strong R or Python chops? Do we double down and suggest SQL/dbt training?

britt-allen commented 1 year ago

Why is the output an email and who are the recipients of this email?

ian-r-rose commented 1 year ago

Given this, what alternatives to dbt might we consider if handoff is to a team without SQL chops, but strong R or Python chops? Do we double down and suggest SQL/dbt training?

Good question. I'm not sure what the best way forwards would be in that case, though there are a few options:

  1. Just drop dbt+Snowflake and use an ETL tool like Airflow, Prefect, or Dagster.
  2. Train them in SQL. We don't necessarily need advanced SQL usage, we can go a long way with relatively simple selects, filters, and joins.
  3. Point them at dbt Python models and Snowflake's python Snowflake API.

I'd probably lean towards number (2). dbt Python models are probably not mature enough to point clients at just yet (they are a very new feature). Although using Snowpark could be an interesting option for some use-cases. It's also probably harder to train even a proficient Python user in orchestration tools than it is to pick up SQL because you have to reason more about containers, dependencies, and cloud ops.

I have less of an idea about R users, but I suspect that people who are familiar with dplyr would find SQL fairly straightforward.

ian-r-rose commented 1 year ago

Why is the output an email and who are the recipients of this email?

The recipients for the bake-off would just be us, though in a real deployment it could be anyone.

As for why an email, I have a few reasons:

  1. Integrating with an email service like AWS SES does require a bit of work, so I'd be curious how easy it is for each solution.
  2. I have a pet theory that a lot of dashboards would be better as an automated email that is sent on a regular schedule to stakeholders.
  3. It's also just a stand-in for whatever dashboard, report, ML model, etc we would have in a real DIF project.
ian-r-rose commented 11 months ago

Closing as not planned for now, though we may revisit with the platform engineer coming on board