argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.88k stars 3.17k forks source link

Archive to artifact repository #4162

Open alexec opened 3 years ago

alexec commented 3 years ago

Summary

Archived workflows could be written to artifact repository, e.g S3/GCP bucket

Use Cases

When would you use this?


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

dcherman commented 3 years ago

This would be fantastic to lower the operational costs of running a workflow archive since you either have to DIY it, or use managed postgres/mysql instances which can be relatively expensive.

Some initial thoughts since I was researching/thinking about it this morning:

I don't know enough about GCP Buckets, however I don't think S3 alone is sufficient for what we would need to implement an alternate archive location. If it were a simple "give me the results of this workflow", then we would be able to use the file name as the key and it would be relatively straightforward, however we have other access patterns like:

And any combination of the ones listed above. Implementing all of that with the LIST operation is likely to be slow/expensive for anything beyond experimental setups, and it gets worse if you have a long TTL on your archived workflows.

Potential Solutions

Implement a separate index store

There's precedent for this in other projects (see Cortex, Loki) for storing the bulk of data in an object store while maintaining a separate index that contains metadata to direct you to the correct objects. In the case of the projects listed above, they support options that include DynamoDB, Bigtable, and Cassandra.

Loki has also shipped with an index store called boltdb-shipper recently which uses BoltDB as the index store and syncs the data to S3 which eliminates the requirement for a separate service like DynamoDB. While this is pretty neat, it may also imply the need for persistence and/or multiple replicas in order to avoid data loss in the event that a workflow controller is lost/crashes.

Implement an HTTP archive to enable experimentation

As an interim solution, what would you think about implementing something like an http persistence layer? i.e.

persistence:
  archive: false
  archiveTTL: 180d

  http:
    url: https://my-experimental-archive-service

  archiveLabelSelector:
    matchLabels:
      workflows.argoproj.io/archive-strategy: "always"

The idea behind that is rather than having the archive implementation built into Argo, we enable people to implement their own solutions by implementing a well-known API. That would allow us to iterate on solutions outside of the project itself, then we could eventually merge one or more solutions back in once they've been proven. The actual implementation of "official" solutions could then be moved to labs projects. One issue to consider here would be how to secure those new implementations; it could be something as simple as a shared secret initially.

alexec commented 3 years ago

See PR for instruction for testing.