This is actually not a bug in Argo CD, but posting it here for anyone else that might be hitting the same issue.
Background:
GKE attached clusters use an agent for connectivity to kube-apiserver from GCP, gke-connect.
Issue:
When the fronting Connect Gateway looses connectivity with the gke-connect agent inside the cluster, it will respond with a HTTP 404 for all API server requests. This is especially problematic for Argo CD's watch requests, as gitops-engine will stop watching that resource. If the connectivity issue persists for an extended period of time, Argo CD will end up not having any watches for the target cluster and its redis cache for the apps on the target cluster will be emptied.
Possible workaround:
Deploy a reverse proxy in front of the Connect Gateway that rewrites 404 responses where cannot connect to agent or connection unexpectedly terminated for (project: xxx, membership: xxx) is present in the response body to a more appropriate response code.
We have reported the issue to the team working on connect gateway but given that this product is already GA, changing the response code for unhealthy backends might be a breaking change that's not easily changed.
Sanitized log line from connectgateway.googleapis.com:
This is actually not a bug in Argo CD, but posting it here for anyone else that might be hitting the same issue.
Background: GKE attached clusters use an agent for connectivity to kube-apiserver from GCP,
gke-connect
.Issue: When the fronting Connect Gateway looses connectivity with the
gke-connect
agent inside the cluster, it will respond with a HTTP 404 for all API server requests. This is especially problematic for Argo CD's watch requests, as gitops-engine will stop watching that resource. If the connectivity issue persists for an extended period of time, Argo CD will end up not having any watches for the target cluster and its redis cache for the apps on the target cluster will be emptied.Possible workaround: Deploy a reverse proxy in front of the Connect Gateway that rewrites 404 responses where
cannot connect to agent or connection unexpectedly terminated for (project: xxx, membership: xxx)
is present in the response body to a more appropriate response code.We have reported the issue to the team working on connect gateway but given that this product is already GA, changing the response code for unhealthy backends might be a breaking change that's not easily changed.
Sanitized log line from connectgateway.googleapis.com:
From the Argo CD side: