dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.52k stars 1.45k forks source link

Multi-image jobs #4277

Open johannkm opened 3 years ago

johannkm commented 3 years ago

Currently dagster's containerized deployments run pipelines using a single image (retrieved from the repository location). In some cases, different solids within a pipeline have vastly different dependencies which are difficult to combine into a single image.

Approach 1: Override the image per solid At runtime, specify image tags for each solid. Dagster will assume that each image contains the same python code for the solid, since the pipeline definition will be ready only from the main image.

Approach 2: Offer better tooling for launching containers from within solids Alternatively, solids could continue to run with a single image but a resource could be provided for launching a command inside another container with a different image. This removes the need to have your pipeline code synced across all your images, and the other images woudn't even need to include dagster or python. The inherent downside is that the compute inside the launched container wouldn't be able to use dagster, e.g. it wouldn't have access to logging (though there are workarounds- the main solid could watch and upload the logs for it).


Message from the maintainers:

Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.

nicklofaso commented 3 years ago

Very excited for a feature like this!

As I mentioned in a previous comment our pipeline involves steps with vastly different dependencies and programming languages (python, R, Fortran, C++, etc.), so trying to combine all of them into a single image would be extremely difficult.

I'm new to Dagster, but I think Approach 1 provides the most value:

darrenhaken commented 3 years ago

I am also excited by this as a feature.

I discussed this in Slack but we currently use Airflow extensively with the Pod Operator. We have found developing Docker images around custom operations offered several advantages:

I see a question came up around logging. It would be handy if Dagster could parse the stdout/err from the Docker image and then roll that into the Dagster logs. Airflow does something similar to this.

johannkm commented 2 years ago

Current status is that this is possible using https://github.com/dagster-io/dagster/pull/4818, but it's experimental and likely not the end state that we'd like to reach.

henripal commented 2 years ago

Would it be possible to see a minimal working example using https://github.com/dagster-io/dagster/pull/4818 to run a multi-image pipeline?

evanvolgas commented 2 weeks ago

I'd love to see a working example of a multi-image pipeline too. I've read the changes in https://github.com/dagster-io/dagster/pull/4818 but it's not clear to me how to put them to use.