Open gibsondan opened 1 year ago
Hi, I'd like to add my input. I am trying to build a bioinformatics pipeline in Dagster. A bit of background, most bioinformatics tools are command line tools; viz. BWA, samtools, bedtools, GATK, fasqc, etc. In dragster this means I need to have these tools available in my environment and run some kind of command shell op. For example I have an op that downloads a BAM file and computes statistics on it with samtools. I then read the output of the shell command and return it.
To do this kind of thing with ECS it makes sense to have each op as a different ECS task. That way I can use different containers for each op, where the container has the bioinformatics command line utility required by the op. Also some of these ops use a large amount of resources (CPU/Memory) and other don't. I would like to scope the resources to the op based on the expected computational workload.
@gibsondan will this feature be implemented any time soon? Alternatively I could implement an @op
where using boto3 I trigger the ecs task, watch for it, and then raise an error in case of failure.
Why I'm interested on this? I generally like to have my orchestration separated from dbt, and I don't want to copy dbt models around, instead I bake a dbt image with all the models, and then trigger the image as container (either in ECS or K8S), I've working with this solution with airflow, but I would like to consider dagster instead of airflow for orchestration.
Here's some additional input; It's very typical for machine learning pipelines to include steps that need to execute on very different hardware -- data transformation may need lots of cpu/memory or very little (i.e. executed on a remote spark/dask cluster) and training may requires instances with GPU. A single job would often need to do all these steps, and running everything on a GPU instance may be cost prohibitive.
Here's some additional input; It's very typical for machine learning pipelines to include steps that need to execute on very different hardware -- data transformation may need lots of cpu/memory or very little (i.e. executed on a remote spark/dask cluster) and training may requires instances with GPU. A single job would often need to do all these steps, and running everything on a GPU instance may be cost prohibitive.
any updates on if this will ever be supported?
What's the use case?
Running each op of a Dagster job in its own ECS task (similar to the k8s_job_executor) to make it easier to horizontally scale jobs with many ops, and give better isolation between ops.
Ideas of implementation
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.