Support p2p volume streaming when workers are in different network

concourse / concourse

Concourse is a container-based continuous thing-doer written in Go.

https://concourse-ci.org

Apache License 2.0

7.25k stars 845 forks source link

Support p2p volume streaming when workers are in different network #7017

Open xtremerui opened 3 years ago

xtremerui commented 3 years ago

Summary

The performance gain of enabling p2p volume streaming is promising, as shown in https://github.com/concourse/concourse/pull/6186. However for use case that workers live within different network the volume streaming will fail as two workers might not see each other i.e. worker in cluster with web nod and external worker.

If we allow worker to have an indicator of its network then we could limit the p2p volume streaming for only workers in same network. Then we could remove feature flag set on enable-p2p-volume-streaming or even make it a default behaviour.

Context

RFC: concourse/rfcs#82
Prior discussion: #
Depends on https://github.com/concourse/concourse/pull/6186
Part of #

anEXPer commented 1 year ago

The multi-IaaS/multi-cloud use-case is core to both the CF and TAS CI use cases of Concourse, as well as "resource gateway" CD patterns in TAS customer installation environments where some workers (able to reach the internet) fetch releases from vendors and put them in a blobstore, and other (non-internet-routable) workers grab them from blob stores and upload them to environments; in all cases the performance gains offered here are also highly relevant; the artifacts that need to get streamed around tend toward the multi-gigabyte.

All of these use-cases use tags, but tags can't be guaranteed to fit network topology. Doing something to actually indicate/detect network would be good, but maybe there's an alternative world where tags offer an alternative, too.

I think this enhancement/fix would be worth an investment.

kallisti5 commented 4 months ago

This one we have been struggling with as well.

We have concourse builders located at various physical sites over slower links, and it ends up being a horrible mess of tags to try and wrangle the environment.

The volumes shipping between nodes seems nonsensical. (why ship a 4GiB volume from node A to node B when neither node has any additional backlog work queued?) Jobs with multiple steps waste a lot of time shuffling volumes around for each step.