argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.87k stars 3.17k forks source link

Implement parallelization to speed up S3 artifacts upload and download #12442

Open nitays opened 8 months ago

nitays commented 8 months ago

Summary

Implement a client that supports parallelism such as s5cmd to enable much faster S3 artifacts I/O speeds, which is crucial when working with many and\or large files.

Use Cases

When working with many and\or large artifacts via S3 artifact repository on a Workflow, the part of uploading and downloading artifacts in the beginning of every step can take more than the processing time of the step itself.


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

tooptoop4 commented 8 months ago

one caution is that a lot of these libs don't handle files larger than 5gb ie https://github.com/peak/s5cmd/issues/29

pranavarnav2 commented 8 months ago

one caution is that a lot of these libs don't handle files larger than 5gb ie peak/s5cmd#29

That is only when copying files from bucket to bucket, which I don't think is a use case with Argo Workflows artifacts. I've just copied a 10GB file from my local computer to an S3 bucket with s5cmd with no problem.

nitays commented 2 weeks ago

Any update on this?