envoyproxy / gateway

Manages Envoy Proxy as a Standalone or Kubernetes-based Application Gateway
https://gateway.envoyproxy.io
Apache License 2.0
1.54k stars 328 forks source link

Upgrade: disable xDS translation during the upgrade process to avoid client request failures #3992

Open zhaohuabing opened 1 month ago

zhaohuabing commented 1 month ago

Description: The upgrade of EG may cause clients to experience request failures or a temporary loss of connectivity to the Envoy

For example, while upgrading from v1.0.2 to v1.1.0 following steps in the upgrade guide, the EG fails to reconcile the BackendTLSPolicy CRs as the CRD is deleted and recreated during the upgrade process.

Before the upgrade, a BackendTLSPolicy CRD is created to configure the TLS settings for the Backend service, following the steps in the Backend TLS: Gateway to Backend task.

Using egctl to get the generanted xDS cluster, you can see the TLS configuration for the Backend Cluster:

       {
          "cluster": {
             ... omitted for brevity
            "name": "httproute/default/backend/rule/0",
            "outlierDetection": {},
            "perConnectionBufferLimitBytes": 32768,
            "transportSocketMatches": [
              {
                "match": {
                  "name": "httproute/default/backend/rule/0/tls/0"
                },
                "name": "httproute/default/backend/rule/0/tls/0",
                "transportSocket": {
                  "name": "envoy.transport_sockets.tls",
                  "typedConfig": {
                    "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
                    "commonTlsContext": {
                      "combinedValidationContext": {
                        "defaultValidationContext": {
                          "matchTypedSubjectAltNames": [
                            {
                              "matcher": {
                                "exact": "www.example.com"
                              },
                              "sanType": "DNS"
                            }
                          ]
                        },
                        "validationContextSdsSecretConfig": {
                          "name": "enable-backend-tls/default-ca",
                          "sdsConfig": {
                            "ads": {},
                            "resourceApiVersion": "V3"
                          }
                        }
                      }
                    },
                    "sni": "www.example.com"
                  }
                }
              }
            ],
            "type": "EDS"
          },
        }

Delete the BackendTLSPolicy CRD, as the upgrade guide suggests:

kubectl delete crd backendtlspolicies.gateway.networking.k8s.io

EG reports the following error message in the logs:

E0802 09:30:12.542305       1 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.2/tools/cache/reflector.go:229: Failed to watch *v1alpha2.BackendTLSPolicy: failed to list *v1alpha2.BackendTLSPolicy: the server could not find the requested resource (get backendtlspolicies.gateway.networking.k8s.io)

The generated xDS does not contain the TLS configuration for the Backend Cluster, and the client requests fails.

Failed request example, as the error message suggests, the envoy sends an HTTP request to the backend service, which is an HTTPS server:

 curl -v -HHost:www.example.com --resolve "www.example.com:80:172.18.255.200" \
http://www.example.com:80/get
* Added www.example.com:80:172.18.255.200 to DNS cache
* Hostname www.example.com was found in DNS cache
*   Trying 172.18.255.200:80...
* Connected to www.example.com (172.18.255.200) port 80
> GET /get HTTP/1.1
> Host:www.example.com
> User-Agent: curl/8.8.0-DEV
> Accept: */*
> 
* Request completely sent off
< HTTP/1.1 400 Bad Request
< date: Fri, 02 Aug 2024 09:48:47 GMT
< transfer-encoding: chunked
< 
Client sent an HTTP request to an HTTPS server.

The xDS cluster does not contain the TLS configuration for the Backend Cluster:

        {
          "cluster": {
            "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
            "circuitBreakers": {
              "thresholds": [
                {
                  "maxRetries": 1024
                }
              ]
            },
            "commonLbConfig": {
              "localityWeightedLbConfig": {}
            },
            "connectTimeout": "10s",
            "dnsLookupFamily": "V4_ONLY",
            "edsClusterConfig": {
              "edsConfig": {
                "ads": {},
                "resourceApiVersion": "V3"
              },
              "serviceName": "httproute/default/backend/rule/0"
            },
            "lbPolicy": "LEAST_REQUEST",
            "name": "httproute/default/backend/rule/0",
            "outlierDetection": {},
            "perConnectionBufferLimitBytes": 32768,
            "type": "EDS"
          },
          "lastUpdated": "2024-08-02T08:50:06.487Z",
          "versionInfo": "3519b134757f0bea49bbf0397de781693ab58f27b315c5ca77d642ee24a03b26"
        }

Similarly, the client request will also fail if any of the Gateway CRDs or EG CRDs have breaking changes and need to be deleted and recreated during the upgrade process. If any of the CRs need to be recreated/modified due to breaking changes, the same issue will occur as well.

Disable xDS translation during the upgrade process

One way to avoid this issue is to disable the xDS translation during the upgrade process. This way, the EG will not generate the xDS configuration with the broken CRs or any temporary middle state during the upgrade process, and the Envoy will continue to use the existing xDS configuration generated with the correct version of CRDs before the upgrade. Once the upgrade is complete, the xDS translation can be enabled, and the EG will generate the xDS configuration with the updated CRDs

Alternative:

Disable xDS server while upgrading.

[optional Relevant Links:]

Any extra documentation required to understand the issue.

cc @envoyproxy/gateway-maintainers

arkodg commented 1 month ago

thanks for raising this issue @zhaohuabing

the issue here is specific to the fact that the below operations are not atomic

kubectl delete crd backendtlspolicies.gateway.networking.k8s.io # delete v1alpha2 BackendTLSPolicy
kubectl apply -f ./gateway-helm/crds/gatewayapi-crds.yaml # install v1alpha3 BackendTLSPolicy

which may cause downtime for a short duration.

rather than adding complexity into control plane for this, I'd prefer if we able to instrument envoy to use stale xds contents (disable connecting to the xds server) for the short duration while the CRDs and control planes were upgraded

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days.