gefyrahq / gefyra

Blazingly-fast :rocket:, rock-solid, local application development :arrow_right: with Kubernetes.
https://gefyra.dev
Apache License 2.0
692 stars 28 forks source link

[Feature] User-specific routable Gefyra bridge ("user bridge") #733

Open Schille opened 3 weeks ago

Schille commented 3 weeks ago

Intro

Gefyra currently supports "global bridge" only. See https://gefyra.dev/docs/run_vs_bridge/#bridge-operation to learn more. In short, a container within a running Pod (or multiple replicas) is replaced by a Gefyra component called Carrier. This allows Gefyra, with some constraints, to route traffic to a local container originally targeted to the specified Pod within the cluster.

gefyra-bridge-action drawio

This capability helps debug local containers using traffic from the cluster, rather than synthetic or made-up local traffic. However, the bridge is currently globally effective for all traffic directed to the bridged Pod, which may sometimes be undesired. This means that only one bridge per Pod can exist, allowing only one user to bridge a Pod at a time. With this feature proposal, we aim to lift that limitation in a flexible yet robust way.

This feature addresses the following issues:

Remark: Remember that one of Gefyra's fundamental premises is not to interfere with Kubernetes objects from your workloads. The proposed feature draft does not involve modifying existing deployment artifacts. Why? If something goes wrong (as things often do), we want Gefyra users to be able to restore the original state simply by deleting Pods. However, there may be residual objects or other additions that should never disrupt the operations of the development cluster. Gefyra aims to minimize this risk by treating it as a bug.

What is the new feature about?

Gefyra's bridge operation will support specific routing configurations to intercept only matching traffic, allowing all unmatched traffic to be served from within the cluster. Multiple users will be able to intercept different traffic simultaneously, receiving it on their local container (started by gefyra run ...) to serve it with local code.

gefyra-personal-bridge drawio

Departure

The main components involved in establishing a Gefyra bridge are:

Remark: Gefyra's cluster component architecture consists of different interfaces. The connection provider and bridge provider are two abstract concepts with defined interfaces. "Stowaway" and "Carrier" are the current concrete implementations of these interfaces. However, depending on the results of this implementation, I expect at least the latter to be replaced by a new component (perhaps Carrier2?). For consistency, I will continue to use these component names.

Overview

gefyra-personal-bridge1 drawio

Carrier

gefyra-personal-bridge2 drawio

Currently, Carrier is installed into 1 to N Pods. Each instance upstreams any incoming traffic ("port x") to a single target endpoint ("upstream-1"). This process does not involve traffic introspection: IP packets come in and are sent out as-is. This setup is simple and fast. Carrier is based on the Nginx server and thus is configured using the stream directive: https://nginx.org/en/docs/stream/ngx_stream_core_module.html#stream

Feature draft

Stage 1: Installation & keep original Pods around to serve unmatched traffic

When a compatible service is bridged, we need the original workloads to serve any unmatched traffic through the user bridge. Consider the following example: a compatible workload <Y> is selected by a Kubernetes service object. This workload consists of 3 Pods.

gefyra-personal-bridge4 drawio

Once a user bridge is requested, Gefyra's Operator replicates all essential components (most importantly, the Pods and the service) by cloning and modifying them. Pod <Y1'> is modified on the fly so that it is selected by service <Y'>. The Pods <Y1'>, <Y2'> and <Y3'> must not be selected by service <Y>. Most other parameters - such as mounts, ports, probes, etc. - should remain unchanged.

gefyra-personal-bridge5 drawio

The cloned workload infrastructure remains active as long as at least one Gefyra user bridge is active.

The Gefyra Operator installs Carrier into the target Pods (<Y1>, <Y2> and <Y3>) and dynamically configures them to send all unmatched traffic to the cloned infrastructure <Y'>. This setup ensures:

gefyra-personal-bridge7 drawio

Of course, if there is a different replication factor or other deployment scenarios (e.g., Pod only), the Gefyra Operator adapts accordingly. I hope the idea makes sense.

Stage 2: Add a local upstream & redirect matching traffic

The Carrier component will require significant changes as we shift from a “stream”-based proxy to a more advanced proxy ruleset, incorporating path and header matching for HTTP, along with routing rules for other protocols in the future. Fortunately, the required changes in the Gefyra Operator are not as extensive as those in Carrier. Several interfaces already support creating different routes within the connection provider ("Stowaway") and bridge provider abstractions.

gefyra-personal-bridge8 drawio

Interface reference for connection providers (Stowaway):

https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/connection/abstract.py#L64-L103

Interface reference for bridge providers (Carrier, Carrier2):

https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/bridge/abstract.py#L40-L61

Rules

The GefyraBridge CRD already supports an arbitrary set of additional configuration parameters for the bridge provider. https://github.com/gefyrahq/gefyra/blob/9fcbf7ec167b5a8bf470f710d8c3f6444f9253be/operator/gefyra/resources/crds.py#L17-L25

For HTTP traffic, the routing parameters appear to be quite obvious:

Each user bridge adds a new entry to the upstream servers for Carrier, along with an additional (verified) matching rule. The operator's validating webhook should implement matching rule validation to catch common mistakes (e.g., a rule already applied by another user or a rule that never fires due to another bridge capturing all traffic). If a matching rule is invalid, the creation of the GefyraBridge is halted immediately.

Stage 3: Remove a user bridge

Removing a bridge is a two-phase process: 1) The deletion request from the Gefyra Client prompts the Operator to initiate the bridge removal. 2) Both the bridge provider and the connection provider are called upon to delete their respective routing configurations.

Remove the last user bridge & clean up

If Stage 3 removes the last active bridge for a Pod, the uninstallation procedure is triggered. This process includes resetting the patched Pods (<Y1>, <Y2> and <Y3>) to their original configuration and removing the cloned infrastructure (Pod <Y1'>, <Y2'>, <Y3'> and service <Y'>).

Closing remarks

This feature is currently in the ideation phase. I would appreciate any external feedback on how to make this as useful and robust as possible. If you want to talk to me about this (or Gefyra in general), please find me around at our Discord server: https://discord.gg/Gb2MSRpChJ

I am also looking for a co-sponsor of this feature. If you or your team want to support this development, please contact me.

liquidiert commented 3 weeks ago

First off: Great RFC @Schille ! Just one quick question: What happens to the shadow infrastructure when the original might change while a bridge is active? I'm sure there's already handling for this case when using global bridge but what is the appropriate procedure here?

Schille commented 3 weeks ago

@liquidiert Gotcha! A rollout of the original workloads would render the bridge useless since Gefyra's patch will be reset. The operator should reconcile all bridges, detect that situation, and take appropriate action (patching again, setting up user bridges to work again). Or declare existing user bridges stale and remove them.

liquidiert commented 3 weeks ago

@Schille that sounds like a good reconciliation tactic, thanks!

crkurz commented 2 weeks ago

This look terriffic, @Schille ! Thanks a lot !

Please allow me to add some questions

  1. Do we need to call out that multiple users can bridge multiple services?
  2. Nit: Terminology: does it make sense to change "and the removal of the phantom infrastructure..." to "and the removal of the cloned infrastructure"? (just to avoid an extra name)
  3. Are there any limitations which apply to infra cloning? Things a pod/service configuration must or must not have? E.g. node- or other-affinity? special session-handling/routing ? Should I try to get Anton's/Rohit's thoughts here? e.g. around special handling for WebSockets with their need for cross-user session handling ?
  4. Are there chances for any impact on validity of server certificates due to the traffic redirection?
  5. How long do we expect setup (or tear-down) of cloned infra to take? and for how much of this time do we expect the regular service to be non-responsive? - In case this could take a bit more time, do we need an option to preserve phantom infra even after removal of last bridge? Or even an option to explicitly install phantom infra independent of bridge setup?

Again, great feature! Thank you, @Schille

Schille commented 2 weeks ago

@crkurz Thank you.

To your questions:

  1. You should already be able to bridge multiple services simultaneously. If that's unclear, we must add that bit to the docs.
  2. You are right. I changed it.
  3. I don't see more limitations than mentioned. Since we'll clone the pods with all attributes (except for the selector-relevant labels) I don't expect affinity issues. But the more people who join the party, the better it is: I would welcome it if you would take up Anton/Rohit's thoughts on this.
  4. Yes, that's not 100% clear as of now. We must find a solution to tell Carrier which certificates to use to introspect SSL traffic and decide on the route.
  5. That depends. Small apps - short setup time. Java - huge setup time. =) I thought about that too and I am tempted to agree to a concept that represents the bare installation of a bridge without actually having a single user to match traffic,