Closed ian-r-rose closed 11 months ago
There are a few things I would look for in an orchestration tool when conducting a workflow like this. Along each of these dimensions, I would score them from "makes me happy" to "makes me sad".
The GIS stack often requires custom software environments. That is to say, whatever default image the tool uses will not do the job, so we'll need to provide our own image. We'll want to evaluate how easy it is to build our own image and provide it to the orchestration tool.
Do they provide a compute cluster or other batch-like service for running custom jobs? Or will we need to bring our own K8s/ECS/Batch cluster? If the latter, how easy is it to set up?
If we do have services running in our own cloud account, how easy is it to interact with them? Are there nice user interfaces for securely providing service account credentials?
How painful is it to deploy new versions of a pipeline? Are there CI tools or custom GitHub actions for doing this? Ideally it is simple to deploy-on-merge.
This is one of the most important: all of the major orchestrators have been implementing some level of integration with dbt (which provides its own DAG abstraction). How pleasant are these integrations to use?
Also, it should go without saying that one of the overriding concerns for the test is "how easy is it to set up and maintain the infrastructure?".
If we want to hand off orchestration pipelines to clients, let's try to choose a tool with the highest probability of success.
I'll also want to compare these with an orchestrator-free approach that uses something like AWS Glue or AWS Batch jobs, and then kicks them off with GitHub actions crons. Approaches like these would have less infrastructure to manage, and would be cheaper to run. The downside is that you couldn't do any DAG creation, and you lose out on the ecosystem of tools that these orchestration tools bring with them.
The design looks straightforward to me and the considerations you have seem sound.
Also, it should go without saying that one of the overriding concerns for the test is "how easy is it to set up and maintain the infrastructure?".
If we want to hand off orchestration pipelines to clients, let's try to choose a tool with the highest probability of success.
Given this, what alternatives to dbt might we consider if handoff is to a team without SQL chops, but strong R or Python chops? Do we double down and suggest SQL/dbt training?
Why is the output an email and who are the recipients of this email?
Given this, what alternatives to dbt might we consider if handoff is to a team without SQL chops, but strong R or Python chops? Do we double down and suggest SQL/dbt training?
Good question. I'm not sure what the best way forwards would be in that case, though there are a few options:
I'd probably lean towards number (2). dbt Python models are probably not mature enough to point clients at just yet (they are a very new feature). Although using Snowpark could be an interesting option for some use-cases. It's also probably harder to train even a proficient Python user in orchestration tools than it is to pick up SQL because you have to reason more about containers, dependencies, and cloud ops.
I have less of an idea about R users, but I suspect that people who are familiar with dplyr would find SQL fairly straightforward.
Why is the output an email and who are the recipients of this email?
The recipients for the bake-off would just be us, though in a real deployment it could be anyone.
As for why an email, I have a few reasons:
Closing as not planned for now, though we may revisit with the platform engineer coming on board
Here is a draft design of an orchestration test:
The basic flow is as follows: