Open zhaohuabing opened 1 month ago
thanks for raising this issue @zhaohuabing
the issue here is specific to the fact that the below operations are not atomic
kubectl delete crd backendtlspolicies.gateway.networking.k8s.io # delete v1alpha2 BackendTLSPolicy
kubectl apply -f ./gateway-helm/crds/gatewayapi-crds.yaml # install v1alpha3 BackendTLSPolicy
which may cause downtime for a short duration.
rather than adding complexity into control plane for this, I'd prefer if we able to instrument envoy to use stale xds contents (disable connecting to the xds server) for the short duration while the CRDs and control planes were upgraded
This issue has been automatically marked as stale because it has not had activity in the last 30 days.
Description: The upgrade of EG may cause clients to experience request failures or a temporary loss of connectivity to the Envoy
For example, while upgrading from v1.0.2 to v1.1.0 following steps in the upgrade guide, the EG fails to reconcile the
BackendTLSPolicy
CRs as the CRD is deleted and recreated during the upgrade process.Before the upgrade, a
BackendTLSPolicy
CRD is created to configure the TLS settings for the Backend service, following the steps in the Backend TLS: Gateway to Backend task.Using egctl to get the generanted xDS cluster, you can see the TLS configuration for the Backend Cluster:
Delete the
BackendTLSPolicy
CRD, as the upgrade guide suggests:EG reports the following error message in the logs:
The generated xDS does not contain the TLS configuration for the Backend Cluster, and the client requests fails.
Failed request example, as the error message suggests, the envoy sends an HTTP request to the backend service, which is an HTTPS server:
The xDS cluster does not contain the TLS configuration for the Backend Cluster:
Similarly, the client request will also fail if any of the Gateway CRDs or EG CRDs have breaking changes and need to be deleted and recreated during the upgrade process. If any of the CRs need to be recreated/modified due to breaking changes, the same issue will occur as well.
Disable xDS translation during the upgrade process
One way to avoid this issue is to disable the xDS translation during the upgrade process. This way, the EG will not generate the xDS configuration with the broken CRs or any temporary middle state during the upgrade process, and the Envoy will continue to use the existing xDS configuration generated with the correct version of CRDs before the upgrade. Once the upgrade is complete, the xDS translation can be enabled, and the EG will generate the xDS configuration with the updated CRDs
Alternative:
Disable xDS server while upgrading.
[optional Relevant Links:]
cc @envoyproxy/gateway-maintainers