Open Schille opened 3 weeks ago
First off: Great RFC @Schille ! Just one quick question: What happens to the shadow infrastructure when the original might change while a bridge is active? I'm sure there's already handling for this case when using global bridge but what is the appropriate procedure here?
@liquidiert Gotcha! A rollout of the original workloads would render the bridge useless since Gefyra's patch will be reset. The operator should reconcile all bridges, detect that situation, and take appropriate action (patching again, setting up user bridges to work again). Or declare existing user bridges stale and remove them.
@Schille that sounds like a good reconciliation tactic, thanks!
This look terriffic, @Schille ! Thanks a lot !
Please allow me to add some questions
Again, great feature! Thank you, @Schille
@crkurz Thank you.
To your questions:
Intro
Gefyra currently supports "global bridge" only. See https://gefyra.dev/docs/run_vs_bridge/#bridge-operation to learn more. In short, a container within a running Pod (or multiple replicas) is replaced by a Gefyra component called Carrier. This allows Gefyra, with some constraints, to route traffic to a local container originally targeted to the specified Pod within the cluster.
This capability helps debug local containers using traffic from the cluster, rather than synthetic or made-up local traffic. However, the bridge is currently globally effective for all traffic directed to the bridged Pod, which may sometimes be undesired. This means that only one bridge per Pod can exist, allowing only one user to bridge a Pod at a time. With this feature proposal, we aim to lift that limitation in a flexible yet robust way.
This feature addresses the following issues:
Remark: Remember that one of Gefyra's fundamental premises is not to interfere with Kubernetes objects from your workloads. The proposed feature draft does not involve modifying existing deployment artifacts. Why? If something goes wrong (as things often do), we want Gefyra users to be able to restore the original state simply by deleting Pods. However, there may be residual objects or other additions that should never disrupt the operations of the development cluster. Gefyra aims to minimize this risk by treating it as a bug.
What is the new feature about?
Gefyra's bridge operation will support specific routing configurations to intercept only matching traffic, allowing all unmatched traffic to be served from within the cluster. Multiple users will be able to intercept different traffic simultaneously, receiving it on their local container (started by
gefyra run ...
) to serve it with local code.Departure
The main components involved in establishing a Gefyra bridge are:
Remark: Gefyra's cluster component architecture consists of different interfaces. The connection provider and bridge provider are two abstract concepts with defined interfaces. "Stowaway" and "Carrier" are the current concrete implementations of these interfaces. However, depending on the results of this implementation, I expect at least the latter to be replaced by a new component (perhaps Carrier2?). For consistency, I will continue to use these component names.
Overview
Carrier
Currently, Carrier is installed into 1 to N Pods. Each instance upstreams any incoming traffic ("port x") to a single target endpoint ("upstream-1"). This process does not involve traffic introspection: IP packets come in and are sent out as-is. This setup is simple and fast. Carrier is based on the Nginx server and thus is configured using the
stream
directive: https://nginx.org/en/docs/stream/ngx_stream_core_module.html#streamFeature draft
Stage 1: Installation & keep original Pods around to serve unmatched traffic
When a compatible service is bridged, we need the original workloads to serve any unmatched traffic through the user bridge. Consider the following example: a compatible workload
<Y>
is selected by a Kubernetes service object. This workload consists of 3 Pods.Once a user bridge is requested, Gefyra's Operator replicates all essential components (most importantly, the Pods and the service) by cloning and modifying them.
Pod <Y1'>
is modified on the fly so that it is selected by service<Y'>
. The Pods<Y1'>
,<Y2'>
and<Y3'>
must not be selected by service<Y>
. Most other parameters - such as mounts, ports, probes, etc. - should remain unchanged.The cloned workload infrastructure remains active as long as at least one Gefyra user bridge is active.
The Gefyra Operator installs Carrier into the target Pods (
<Y1>
,<Y2>
and<Y3>
) and dynamically configures them to send all unmatched traffic to the cloned infrastructure<Y'>
. This setup ensures:ReplicationSet <Y>
Of course, if there is a different replication factor or other deployment scenarios (e.g., Pod only), the Gefyra Operator adapts accordingly. I hope the idea makes sense.
Stage 2: Add a local upstream & redirect matching traffic
The Carrier component will require significant changes as we shift from a “stream”-based proxy to a more advanced proxy ruleset, incorporating path and header matching for HTTP, along with routing rules for other protocols in the future. Fortunately, the required changes in the Gefyra Operator are not as extensive as those in Carrier. Several interfaces already support creating different routes within the connection provider ("Stowaway") and bridge provider abstractions.
Interface reference for connection providers (Stowaway):
https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/connection/abstract.py#L64-L103
Interface reference for bridge providers (Carrier, Carrier2):
https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/bridge/abstract.py#L40-L61
Rules
The
GefyraBridge
CRD already supports an arbitrary set of additional configuration parameters for the bridge provider. https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/resources/crds.py#L17-L25For HTTP traffic, the routing parameters appear to be quite obvious:
/api/objects/5
)owner: john
)Each user bridge adds a new entry to the upstream servers for Carrier, along with an additional (verified) matching rule. The operator's validating webhook should implement matching rule validation to catch common mistakes (e.g., a rule already applied by another user or a rule that never fires due to another bridge capturing all traffic). If a matching rule is invalid, the creation of the GefyraBridge is halted immediately.
Stage 3: Remove a user bridge
Removing a bridge is a two-phase process: 1) The deletion request from the Gefyra Client prompts the Operator to initiate the bridge removal. 2) Both the bridge provider and the connection provider are called upon to delete their respective routing configurations.
Remove the last user bridge & clean up
If Stage 3 removes the last active bridge for a Pod, the uninstallation procedure is triggered. This process includes resetting the patched Pods (
<Y1>
,<Y2>
and<Y3>
) to their original configuration and removing the cloned infrastructure (Pod<Y1'>
,<Y2'>
,<Y3'>
and service<Y'>
).Closing remarks
gefyra bridge ... --global
to enable the global bridge with its current behavior.This feature is currently in the ideation phase. I would appreciate any external feedback on how to make this as useful and robust as possible. If you want to talk to me about this (or Gefyra in general), please find me around at our Discord server: https://discord.gg/Gb2MSRpChJ
I am also looking for a co-sponsor of this feature. If you or your team want to support this development, please contact me.