argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.86k stars 5.45k forks source link

Missing resource tree for apps destined for connect gateway / GKE attached clusters #17900

Closed torfjor closed 4 months ago

torfjor commented 6 months ago

This is actually not a bug in Argo CD, but posting it here for anyone else that might be hitting the same issue.

Background: GKE attached clusters use an agent for connectivity to kube-apiserver from GCP, gke-connect.

Issue: When the fronting Connect Gateway looses connectivity with the gke-connect agent inside the cluster, it will respond with a HTTP 404 for all API server requests. This is especially problematic for Argo CD's watch requests, as gitops-engine will stop watching that resource. If the connectivity issue persists for an extended period of time, Argo CD will end up not having any watches for the target cluster and its redis cache for the apps on the target cluster will be emptied.

Possible workaround: Deploy a reverse proxy in front of the Connect Gateway that rewrites 404 responses where cannot connect to agent or connection unexpectedly terminated for (project: xxx, membership: xxx) is present in the response body to a more appropriate response code.

We have reported the issue to the team working on connect gateway but given that this product is already GA, changing the response code for unhealthy backends might be a breaking change that's not easily changed.

Sanitized log line from connectgateway.googleapis.com:

{
  "protoPayload": {
    "@type": "type.googleapis.com/google.cloud.audit.AuditLog",
    "status": {
      "code": 5,
      "message": "cannot connect to agent or connection unexpectedly terminated for (project: xxx, membership: xxx)"
    },
    "serviceName": "connectgateway.googleapis.com",
    "methodName": "google.cloud.gkeconnect.gateway.v1.GatewayService.GetResource",
    "authorizationInfo": [
      {
        "permission": "gkehub.gateway.get",
        "granted": true,
        "resourceAttributes": {},
        "permissionType": "ADMIN_READ"
      }
    ],
    "request": {
      "@type": "type.googleapis.com/google.api.HttpBody"
    }
  },
  "resource": {
    "type": "audited_resource",
    "labels": {
      "service": "connectgateway.googleapis.com",
      "method": "google.cloud.gkeconnect.gateway.v1.GatewayService.GetResource"
    }
  },
  "timestamp": "2024-04-18T19:11:35.858116698Z",
  "severity": "ERROR",
  "labels": {
    "k8s-request-path": "api/v1/configmaps?allowWatchBookmarks=true&resourceVersion=5121570&watch=true"
  }
}

From the Argo CD side:

time="2024-04-18T21:12:57Z" level=info msg="Stop watching: ConfigMap not found" server="https://us-central1-connectgateway.googleapis.com/v1/projects/xxx/locations/us-central1/memberships/xxx"
torfjor commented 4 months ago

Small update on this. Requests that fail because the agent in unavailable should now return another error than 404 (in the 4xx range).

torfjor commented 4 months ago

Marking as closed