argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15k stars 3.2k forks source link

Volumes Instead of Sidecars for the Artifact Repository #1024

Open vicaire opened 6 years ago

vicaire commented 6 years ago

FEATURE REQUEST: Volumes Instead of Sidecars to upload/download data to the default Artifact Repository

Hi, I was wondering why Argo decided to use a sidecar to download/upload data to GCS/S3/etc when using the Default Artifact Repository.

Did we consider using the Volume abstraction in Kubernetes? It looks that there are types of volumes for many kinds of storage and that it would make it easy to add a new kind of storage for the Default Artifact Repository by implementing a new kind of volume.

https://ai.intel.com/kubernetes-volume-controller-kvc-data-management-tailored-for-machine-learning-workloads-in-kubernetes/

https://kubernetes.io/docs/concepts/storage/volumes/

vicaire commented 6 years ago

/cc @jlewi

wookasz commented 6 years ago

If I understand your question correctly, the sidecar ensures that the specified files are stored to a specific location in the artifact repository, and that specific files are fetched to a specific location in the container. Without a sidecar it would not be possible to do this as configuration. It would be up to the step logic to do this.

For example, if step 1 wires a.csv, b.csv, and tmp.csv to /output/ in the container, we may only want a.csv and b.csv stored as artifacts. Step 2 may only require b.csv. Furthermore, the step 2 container may expect the input file to be named input.csv so a rename is required. The sidecar does this without requiring the step to perform that logic.

It would also be possible for steps to modify/delete the artifact of another step. That removes what I believe to be a key feature of any workflow/pipeline manager, which is data provenance.

Ark-kun commented 6 years ago

It would also be possible for steps to modify/delete the artifact of another step.

The inputs volume can be mounted in read-only mode.

the sidecar ensures that the specified files are stored to a specific location in the artifact repository, and that specific files are fetched to a specific location in the container. Without a sidecar it would not be possible to do this as configuration.

You can mount any inputs/outputs volume subpath to any container location.

E.g. for task3 that uses artifacts from task1 and task2:

Mount <repository>/workflow1/task1/outputs/output1/ to /io/inputs/input1/ in read-only mode

Mount <repository>/workflow1/task2/outputs/output1/ to /io/inputs/input2/ in read-only mode

Mount <repository>/workflow1/task3/outputs/output1/ to /io/outputs/output2/ for writing

container may expect the input file to be named input.csv

Ideally, containers should only use paths received from the command line arguments.

vicaire commented 5 years ago

@wookasz

Let me reformulate a bit. Instead of the artifact repository being GCS/S3/MinioServer, would it be possible to have an option to store the data in a volume?

Given the large number of volume implementations (NFS, GCP Cloud Filer, etc.), it seems that this would support a large number of use cases beyond object stores.

vicaire commented 5 years ago

@IronPan @paveldournov

vicaire commented 5 years ago

@hongye-sun