envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.51k stars 324 forks source link

Modify EnvoyProxy podLabels causes reconcilation errors #1818

Closed wondersd closed 3 weeks ago

wondersd commented 1 year ago

Description: This PR added support to customize labels on Envoy Proxy pods. However it went a step further to use those user specified labels as the labelSelector for the Deployment object. This is problematic as Deployment.spec.labelSelector is immutable and requires deleting the Deployment object so it can be recreated with the new labelSelector. As labels on pods may serve many mundane purposes such as ownership tracking, cost tracking, log metadata, version/release metadata. They can be rather fluid and I don't think should cause reconciler errors or the need of manual intervention to modify.

https://github.com/envoyproxy/gateway/blob/main/internal/infrastructure/kubernetes/proxy/resource_provider.go#L178

I believe this line should go back to selector := resource.GetSelector(labels)

Repro steps:

  1. setup gatewayclass, envoyproxy and gateway resource as per normal with no custom configuration
  2. update EnvoyProxy .spec.provider.kubernetes.pod.labels."app.kubernetes.io/instance" = "foobar"

Environment: eg version 0.5.0

Logs:

failed to create or update deployment envoy-gateway/envoy-...-72616262: for Update: Deployment.apps \"envoy-...-ga-72616262\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app.kubernetes.io/component\":\"proxy\", \"app.kubernetes.io/instance\":\"envoy\", \"app.kubernetes.io/managed-by\":\"envoy-gateway\", \"app.kubernetes.io/name\":\"envoy\", \"gateway.envoyproxy.io/owning-gateway-name\":\"...\", \"gateway.envoyproxy.io/owning-gateway-namespace\":\"...\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable", "errorVerbose": "Deployment.apps \"envoy-...-ga-72616262\" is invalid: spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"app.kubernetes.io/component\":\"proxy\", \"app.kubernetes.io/instance\":\"envoy\", \"app.kubernetes.io/managed-by\":\"envoy-gateway\", \"app.kubernetes.io/name\":\"envoy\", \"gateway.envoyproxy.io/owning-gateway-name\":\"...\", \"gateway.envoyproxy.io/owning-gateway-namespace\":\"...\"
shawnh2 commented 1 year ago

the initial thought was allow users to customize labels for envoy-proxy pod.

the label field indeed is very fluid, but prefer to be certain when it starts up, don't often change it at runtime.

cc @arkodg

arkodg commented 1 year ago

is the fix here to recreate Envoy Proxy deployment rather than update ?

shawnh2 commented 1 year ago

is the fix here to recreate Envoy Proxy deployment rather than update ?

sure, the kubernetes generally discouraged to make label selector updates and it is suggested to plan your selectors up front.

if one insist in changing it, the better practice will be using helm upgrade instead.

arkodg commented 1 year ago

this feels similar to https://github.com/envoyproxy/gateway/pull/1840, where we will need to enhance docs to highlight that edits to specific field dont cause an update, but a delete + create instead.

if case above option causes unintended consequences (if users aren't reading the docs), we can define EnvoyProxy as a template spec rather than a real time spec i.e. it is only used to define an infrastructure, not update it. I fear this will introduce another set of operational problems if we go with the approach

wondersd commented 1 year ago

From a user perspective, it would be best to see pod labels used to only modify the pod labels and not the selector.

While it is possible to delete/create the deployment (even sometimes with cascade=orphan to lessen the impact) this makes the notion of modifying labels an overly burdensome process with many caveats.

I would pefer to see the selector get reverted back with a one time manual action to delete existing deployments (for those who have used this so far, including myself), at which point any modifications to pod labels are purely for the pod and not the selector.

If modifying the selector is desired from a user perspective (which I have a hard time coming up with a good reason to actually want), should have a separate field for that as to not couple modifying selectorLabels (and their caveats) with pod labels

github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

sanposhiho commented 2 months ago

/assign

It's closely related to #3666.

@arkodg

From a user perspective, it would be best to see pod labels used to only modify the pod labels and not the selector.

I agree with the last comment from @wondersd. Given implementing a safe recreation of deployments would be troublesome, it's better to only modify Pod labels, while we don't modify the selector. (if we don't have a strong reason to modify the selector too, though)

sanposhiho commented 2 months ago

One downside would be that it'd be a breaking change for existing users; existing deployments on clusters of existing users already have a customized selector. If we simply introduced a change not to manage deployment's selector, the reconciliation of such existing users could fail at updating deployments after upgrading the envoy gateway because the controller would try to change a selector from a custom one. But, I believe we can easily workaround this issue.

sanposhiho commented 2 months ago

@arkodg If you give me the go-ahead, I can create a PR for a proposed solution.

arkodg commented 2 months ago

hey @sanposhiho thanks for picking this one up wdyt on the proposed solution @envoyproxy/gateway-maintainers

arkodg commented 1 month ago

hey @sanposhiho thanks for digging into this one, +1 to your proposed solution of not using the user defined labels for pod selectors which can change often. As you've highlighted, this will be an issue in version upgrades, how do we mitigate this ?

sanposhiho commented 1 month ago

The simplest workaround would be-

// https://github.com/envoyproxy/gateway/blob/f4c53f4899b543a6461e7bf666a9bfa3b476e6b4/internal/infrastructure/kubernetes/infra_resource.go#L119
// ... change the selector not to be generated from the custom labels.

    var old *appsv1.Deployment
    err = i.Client.Get(ctx, types.NamespacedName{Name: deployment.Name, Namespace: deployment.Namespace}, old)
    if err != nil {
        if apierrors.IsNotFound(err) {
            // It's the deployment creation.
            return i.Client.ServerSideApply(ctx, deployment)
        }
        return err
    }

    if !equality.Semantic.DeepEqual(old.Spec.Selector, deployment.Spec.Selector) {
        // Note: Deployment created by the old gateway controller may have a different selector generated based on a custom label feature,
        // and it caused the issue that the gateway controller cannot update the deployment when users change the custom labels.
        // Therefore, we changed the gateway to always use the same selector, independent of the custom labels -
        // https://github.com/envoyproxy/gateway/issues/1818
        //
        // But, the change could break an existing deployment with custom labels created by the old gateway controller
        // because the selector would be different.
        //
        // Here, as a workaround, we always copy the selector from the old deployment to the new deployment
        // so that the update can be always applied successfully.
        deployment.Spec.Selector = old.Spec.Selector

        // We have to check an overwritten selector is valid or not. 
        selector, err := metav1.LabelSelectorAsSelector(deployment.Spec.Selector)
        if err != nil { return err }

        if !selector.Matches(labels.Set(deployment.Spec.Template.Labels)) {
            // If the selector now doesn't match with labels of the pod template, return an error.
            // It could happen, for example, when users changed the custom label from {"foo": "bar"} to {"foo": "barv2"}
            // because the pod's labels have {"foo": "barv2"} while the selector keeps {"foo": "bar"}.
            // We cannot help this case, and just error it out.
            // Users should recreate the envoy proxy with the new custom label, instead of upgrading it.
            // Once they recreate the envoy proxy, the envoy gateway of this version doesn't generate the selector based on the custom label,
            // and the issue won't happen again, even if they have to the custom label again.
            return fmt.Errorf(`An illegal change in a custom label of envoy proxy is detected.
The deployment %s has a custom label that is not allowed to be changed.
All the custom labels of envoy proxy, which is intiated with the envoy gateway of v1.0.x or earlier, are immutable.
Please recreate an envoy proxy with a new custom label if you need to change the custom label.
This issue won't happen with the envoy proxy resource initialized by the envoygateway v1.0.y or later.`, deployment.Name)
        }
    }

    return i.Client.ServerSideApply(ctx, deployment)
}

then, all new deployments will get a selector independent of the custom label configuration, while all existing deployments will keep the current selector, which may or may not created based on the custom label configuration. So, the version upgrade won't break a reconciliation of existing users with custom labels.

But, it also means we only help future users; we don't help existing users who already have a deployment created with a custom label and want to mutate the label. They should recreate envoyproxy by themselves to recreate the underlying deployment, like the error in the above pseudo impl says.

arkodg commented 1 month ago

this sounds reasonable, thanks for handling this case !

sanposhiho commented 1 month ago

There we go :) https://github.com/envoyproxy/gateway/pull/3995

wondersd commented 3 weeks ago

@sanposhiho @arkodg Thanks!