argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.95k stars 5.46k forks source link

`argocd app sync` CLI command takes a long time to sync and seems to lose gRPC events. #12214

Open alionski opened 1 year ago

alionski commented 1 year ago

Checklist:

Describe the bug

This issue is very similar to the other one from me https://github.com/argoproj/argo-cd/issues/12211 and is related to https://github.com/argoproj/argo-cd/issues/7284 and https://github.com/argoproj/argo-cd/issues/8910

When syncing an application via the cli with argocd app sync, the status reported by the cli lags behind the updates in the application CR and the last updates in the CR seem to be missed by the cli client or the server never sends them. Adding the --grpc-web flag makes no difference.

I was looking into why the cli is so slow in our CI when there are no updates to the application and thus no syncing is needed (last time it took over 1hr). The argocd app sync command basically hangs until the timeout is reached. To figure out what the client was seeing vs what was happening in the cluster, I added the following (quite aggressive and chatty) logs to the the cli client from master:

// check if resource health, sync and operation statuses matches watch options
func checkResourceStatus(watch watchOpts, healthStatus string, syncStatus string, operationStatus *argoappv1.Operation) bool {
       < ... omitted code ... >
    synced := !watch.sync || syncStatus == string(argoappv1.SyncStatusCodeSynced)
    operational := !watch.operation || operationStatus == nil
    now := time.Now()
    log.Errorf("[%s] checkResourceStatus(), synced: %v", now, synced)
    log.Errorf("[%s] checkResourceStatus(), value of watch.operation: %v", now, watch.operation)
    log.Errorf("[%s] checkResourceStatus(), operation status: %v", now, operationStatus)
    log.Errorf("[%s] checkResourceStatus(), operational: %v", now, operational)
    log.Errorf("[%s] checkResourceStatus(), health check passed: %v", now, healthCheckPassed)
    return synced && healthCheckPassed && operational
}
         < ... omitted code ... >
    for appEvent := range appEventCh {
        log.Errorf("[%s] waitOnApplicationStatus(), got an event", time.Now())
        app = &appEvent.Application

        operationInProgress := false
        // consider the operation is in progress
        if app.Operation != nil {
            // if it just got requested
            operationInProgress = true
            if !app.Operation.DryRun() {
                refresh = true
            }
        } else if app.Status.OperationState != nil {
            if app.Status.OperationState.FinishedAt == nil {
                // if it is not finished yet
                log.Errorf("[%s] waitOnApplicationStatus(), operation not finished yet", time.Now())
                operationInProgress = true
            } else if !app.Status.OperationState.Operation.DryRun() && (app.Status.ReconciledAt == nil || app.Status.ReconciledAt.Before(app.Status.OperationState.FinishedAt)) {
                // if it is just finished and we need to wait for controller to reconcile app once after syncing
                operationInProgress = true
                log.Errorf("[%s] waitOnApplicationStatus(), operation just finished, refreshing", time.Now())
            }
        }

        var selectedResourcesAreReady bool

        // If selected resources are included, wait only on those resources, otherwise wait on the application as a whole.
        if len(selectedResources) > 0 {
            selectedResourcesAreReady = true
            for _, state := range getResourceStates(app, selectedResources) {
                resourceIsReady := checkResourceStatus(watch, state.Health, state.Status, appEvent.Application.Operation)
                if !resourceIsReady {
                    selectedResourcesAreReady = false
                    break
                }
            }
        } else {
            // Wait on the application as a whole
            selectedResourcesAreReady = checkResourceStatus(watch, string(app.Status.Health.Status), string(app.Status.Sync.Status), appEvent.Application.Operation)
        }

        log.Errorf("[%s] waitOnApplicationStatus(), resourceIsReady value: %v", time.Now(), selectedResourcesAreReady)

        if selectedResourcesAreReady && (!operationInProgress || !watch.operation) {
            app = printFinalStatus(app)
            return app, nil
        }
                < ... omitted code ... >
            < ... omitted code ... >
            if err == nil {
                var wc applicationpkg.ApplicationService_WatchClient
                wc, err = appIf.Watch(ctx, &applicationpkg.ApplicationQuery{
                    Name:            &appName,
                    AppNamespace:    &appNs,
                    ResourceVersion: &revision,
                })
                if err == nil {
                    for {
                        var appEvent *v1alpha1.ApplicationWatchEvent
                        appEvent, err = wc.Recv()
                        log.Errorf("[%s] WatchApplicationWithRetry(), app event: %v", time.Now(), appEvent)
                        if err != nil {
                            log.Errorf("[%s] WatchApplicationWithRetry(), recv failed: %s", time.Now(), err)
                            break
                        }
                        revision = appEvent.Application.ResourceVersion
                        appEventsCh <- appEvent
                    }
                }
            }
            if err != nil {
                log.Errorf("[%s] WatchApplicationWithRetry(), failed to create client: %s", time.Now(), err)
                if isCanceledContextErr(err) {
                    cancelled = true
                    log.Errorf("[%s] WatchApplicationWithRetry(), context cancelled", time.Now())
                } else {
                    time.Sleep(1 * time.Second)
                    log.Errorf("[%s] WatchApplicationWithRetry(): sleeping for 1 sec", time.Now())
                }
            }
            if conn != nil {
                _ = conn.Close()
            }
             < ... omitted code ... >

Then I started watching the application CR, deleted ConfigMap redacted-cm-name that belongs to it and executed argocd app sync. Output of the watch:

$ kubectl get -w app redacted-app-name -ojson | jq -r '.status.reconciledAt, .operation // "No operation in progress", .status.health.status, .status.sync.status'
2023-01-30T13:07:14Z
No operation in progress
Healthy
Synced
2023-01-30T13:07:14Z    <---- me deleting the config map 
No operation in progress
Healthy
OutOfSync   
2023-01-30T13:07:14Z
{
  "initiatedBy": {
    "username": "ci"
  },
  "retry": {},
  "sync": {
    "revision": "77d094415869d18dcb54b7245a109086386b3b3a",
    "syncStrategy": {
      "hook": {}
    }
  }
}
Healthy
OutOfSync
2023-01-30T13:07:14Z
{
  "initiatedBy": {
    "username": "ci"
  },
  "retry": {},
  "sync": {
    "revision": "77d094415869d18dcb54b7245a109086386b3b3a",
    "syncStrategy": {
      "hook": {}
    }
  }
}
Healthy
OutOfSync
2023-01-30T13:10:14Z
{
  "initiatedBy": {
    "username": "ci"
  },
  "retry": {},
  "sync": {
    "revision": "77d094415869d18dcb54b7245a109086386b3b3a",
    "syncStrategy": {
      "hook": {}
    }
  }
}
Healthy
OutOfSync
2023-01-30T13:10:14Z
{
  "initiatedBy": {
    "username": "ci"
  },
  "retry": {},
  "sync": {
    "revision": "77d094415869d18dcb54b7245a109086386b3b3a",
    "syncStrategy": {
      "hook": {}
    }
  }
}
Healthy
Synced
2023-01-30T13:10:14Z
{
  "initiatedBy": {
    "username": "ci"
  },
  "retry": {},
  "sync": {
    "revision": "77d094415869d18dcb54b7245a109086386b3b3a",
    "syncStrategy": {
      "hook": {}
    }
  }
}
Healthy
Synced
2023-01-30T13:10:14Z
No operation in progress
Healthy
Synced
2023-01-30T13:10:17Z
No operation in progress
Healthy
Synced

and the output of argocd app sync from cli client from master with my logs (prepended with ----------------):

$ /home/alenahramyka/code/argo-cd/dist/argocd app sync -l component=redacted-component
----------------ERRO[0004] [2023-01-30 14:10:17.333326614 +0100 CET m=+4.666303920] WatchApplicationWithRetry(), app event: &ApplicationWatchEvent{Type:ADDED, <the rest of the ApplicationWatchEvent object>} 
----------------ERRO[0004] [2023-01-30 14:10:17.334400462 +0100 CET m=+4.667377759] waitOnApplicationStatus(), got an event 
----------------ERRO[0004] [2023-01-30 14:10:17.334510074 +0100 CET m=+4.667487385] checkResourceStatus(), synced: true 
----------------ERRO[0004] [2023-01-30 14:10:17.334510074 +0100 CET m=+4.667487385] checkResourceStatus(), value of watch.operation: true 
----------------ERRO[0004] [2023-01-30 14:10:17.334510074 +0100 CET m=+4.667487385] checkResourceStatus(), operation status: &Operation{Sync:&SyncOperation{Revision:<revision>,Prune:false,DryRun:false,SyncStrategy:&SyncStrategy{Apply:nil,Hook:&SyncStrategyHook{SyncStrategyApply:SyncStrategyApply{Force:false,},},},Resources:[]SyncOperationResource{},Source:nil,Manifests:[],SyncOptions:[],Sources:[]ApplicationSource{},Revisions:[],},InitiatedBy:OperationInitiator{Username:ci,Automated:false,},Info:[]*Info{},Retry:RetryStrategy{Limit:0,Backoff:nil,},} 
----------------ERRO[0004] [2023-01-30 14:10:17.334510074 +0100 CET m=+4.667487385] checkResourceStatus(), operational: false 
----------------ERRO[0004] [2023-01-30 14:10:17.334510074 +0100 CET m=+4.667487385] checkResourceStatus(), health check passed: true 
----------------ERRO[0004] [2023-01-30 14:10:17.334782176 +0100 CET m=+4.667759472] waitOnApplicationStatus(), resourceIsReady value: false 
TIMESTAMP                  GROUP                            KIND           NAMESPACE                  NAME            STATUS    HEALTH        HOOK  MESSAGE
2023-01-30T14:10:17+01:00  rbac.authorization.k8s.io  ClusterRoleBinding              neo4j-database-transformer      Synced                        
2023-01-30T14:10:17+01:00  rbac.authorization.k8s.io  ClusterRole                   
----------------ERRO[0004] [2023-01-30 14:10:17.334494936 +0100 CET m=+4.667472269] WatchApplicationWithRetry(), app event: &ApplicationWatchEvent{Type:MODIFIED, <the rest of the ApplicationWatchEvent object>} 
     redacted-cr-name            Synced                        
2023-01-30T14:10:17+01:00                                Service             default     redacted-svc-name            Synced   Healthy              
2023-01-30T14:10:17+01:00                             ServiceAccount         default     redacted-sa-name             Synced                        
2023-01-30T14:10:17+01:00                             ServiceAccount         default     redacted-sa-name             Synced                        
2023-01-30T14:10:17+01:00   apps                      StatefulSet            default     redacted-sts-name            Synced   Healthy              
2023-01-30T14:10:17+01:00  rbac.authorization.k8s.io  ClusterRole                        redacted-cr-name             Synced                        
2023-01-30T14:10:17+01:00  rbac.authorization.k8s.io  ClusterRoleBinding                 redacted-crb-name            Synced                        
2023-01-30T14:10:17+01:00                              ConfigMap             default     redacted-cm-name             OutOfSync  Missing              
----------------ERRO[0004] [2023-01-30 14:10:17.336098753 +0100 CET m=+4.669076072] waitOnApplicationStatus(), got an event 
----------------ERRO[0004] [2023-01-30 14:10:17.336144753 +0100 CET m=+4.669122053] checkResourceStatus(), synced: true 
----------------ERRO[0004] [2023-01-30 14:10:17.336144753 +0100 CET m=+4.669122053] checkResourceStatus(), value of watch.operation: true 
----------------ERRO[0004] [2023-01-30 14:10:17.336144753 +0100 CET m=+4.669122053] checkResourceStatus(), operation status: &Operation{Sync:&SyncOperation{Revision:77d094415869d18dcb54b7245a109086386b3b3a,Prune:false,DryRun:false,SyncStrategy:&SyncStrategy{Apply:nil,Hook:&SyncStrategyHook{SyncStrategyApply:SyncStrategyApply{Force:false,},},},Resources:[]SyncOperationResource{},Source:nil,Manifests:[],SyncOptions:[],Sources:[]ApplicationSource{},Revisions:[],},InitiatedBy:OperationInitiator{Username:ci,Automated:false,},Info:[]*Info{},Retry:RetryStrategy{Limit:0,Backoff:nil,},} 
----------------ERRO[0004] [2023-01-30 14:10:17.336144753 +0100 CET m=+4.669122053] checkResourceStatus(), operational: false 
----------------ERRO[0004] [2023-01-30 14:10:17.336144753 +0100 CET m=+4.669122053] checkResourceStatus(), health check passed: true 
----------------ERRO[0004] [2023-01-30 14:10:17.336388763 +0100 CET m=+4.669366054] WatchApplicationWithRetry(), app event: &ApplicationWatchEvent{Type:MODIFIED, <the rest of the ApplicationWatchEvent object>} 
----------------ERRO[0004] [2023-01-30 14:10:17.355148645 +0100 CET m=+4.688125891] waitOnApplicationStatus(), resourceIsReady value: false 
2023-01-30T14:10:17+01:00          ConfigMap     default  redacted-cm-name    Synced  Missing              
----------------ERRO[0004] [2023-01-30 14:10:17.36098947 +0100 CET m=+4.693966714] waitOnApplicationStatus(), got an event 
----------------ERRO[0004] [2023-01-30 14:10:17.361006127 +0100 CET m=+4.693983371] checkResourceStatus(), synced: true 
----------------ERRO[0004] [2023-01-30 14:10:17.361006127 +0100 CET m=+4.693983371] checkResourceStatus(), value of watch.operation: true 
----------------ERRO[0004] [2023-01-30 14:10:17.361006127 +0100 CET m=+4.693983371] checkResourceStatus(), operation status: &Operation{Sync:&SyncOperation{Revision:77d094415869d18dcb54b7245a109086386b3b3a,Prune:false,DryRun:false,SyncStrategy:&SyncStrategy{Apply:nil,Hook:&SyncStrategyHook{SyncStrategyApply:SyncStrategyApply{Force:false,},},},Resources:[]SyncOperationResource{},Source:nil,Manifests:[],SyncOptions:[],Sources:[]ApplicationSource{},Revisions:[],},InitiatedBy:OperationInitiator{Username:ci,Automated:false,},Info:[]*Info{},Retry:RetryStrategy{Limit:0,Backoff:nil,},} 
----------------ERRO[0004] [2023-01-30 14:10:17.361006127 +0100 CET m=+4.693983371] checkResourceStatus(), operational: false 
----------------ERRO[0004] [2023-01-30 14:10:17.361006127 +0100 CET m=+4.693983371] checkResourceStatus(), health check passed: true 
----------------ERRO[0004] [2023-01-30 14:10:17.361068657 +0100 CET m=+4.694045892] waitOnApplicationStatus(), resourceIsReady value: false 
----------------ERRO[0064] [2023-01-30 14:11:17.351898199 +0100 CET m=+64.684875502] WatchApplicationWithRetry(), app event: nil 
----------------ERRO[0064] [2023-01-30 14:11:17.351987523 +0100 CET m=+64.684964857] WatchApplicationWithRetry(), recv failed: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0064] [2023-01-30 14:11:17.352056007 +0100 CET m=+64.685033311] WatchApplicationWithRetry(), failed to create client: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0065] [2023-01-30 14:11:18.352451972 +0100 CET m=+65.685429229] WatchApplicationWithRetry(): sleeping for 1 sec 
----------------ERRO[0145] [2023-01-30 14:12:38.398583165 +0100 CET m=+145.731560467] WatchApplicationWithRetry(), app event: nil 
----------------ERRO[0145] [2023-01-30 14:12:38.398665858 +0100 CET m=+145.731643158] WatchApplicationWithRetry(), recv failed: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0145] [2023-01-30 14:12:38.398707924 +0100 CET m=+145.731685209] WatchApplicationWithRetry(), failed to create client: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0146] [2023-01-30 14:12:39.399033445 +0100 CET m=+146.732010692] WatchApplicationWithRetry(): sleeping for 1 sec 
----------------ERRO[0226] [2023-01-30 14:13:59.456123928 +0100 CET m=+226.789101176] WatchApplicationWithRetry(), app event: nil 
----------------ERRO[0226] [2023-01-30 14:13:59.456151803 +0100 CET m=+226.789129051] WatchApplicationWithRetry(), recv failed: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0226] [2023-01-30 14:13:59.456165416 +0100 CET m=+226.789142664] WatchApplicationWithRetry(), failed to create client: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0227] [2023-01-30 14:14:00.456208064 +0100 CET m=+227.789185310] WatchApplicationWithRetry(): sleeping for 1 sec 
----------------ERRO[0307] [2023-01-30 14:15:20.481889341 +0100 CET m=+307.814866646] WatchApplicationWithRetry(), app event: nil 
----------------ERRO[0307] [2023-01-30 14:15:20.481981925 +0100 CET m=+307.814959209] WatchApplicationWithRetry(), recv failed: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0307] [2023-01-30 14:15:20.482028343 +0100 CET m=+307.815005629] WatchApplicationWithRetry(), failed to create client: rpc error: code = Unavailable desc = closing transport due to: connection error: desc = "error reading from server: EOF", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: "too_many_pings" 
----------------ERRO[0308] [2023-01-30 14:15:21.482490742 +0100 CET m=+308.815468037] WatchApplicationWithRetry(): sleeping for 1 sec

<<<< These `Unavailable desc` errors continue until the timeout is reached >>>>>

As you can see, the sync finished very quickly and Operation disappears from the CR at 2023-01-30T13:10:14Z (note that the time in logs is CET, not Zulu), but the cli was getting messages with a delay, the last message the client received still contained Operation, and it never got a message that the application reached the desired state (I assume it has to be an event of type DELETED since the Operation is gone from the CR). The result is that sync hangs until the timeout is reached instead of exiting once the app is synced and happy. At the same time, the UI followed the changes in the CR without a delay and was showing the correct status.

To Reproduce

Create an application that has resources that take a very short time to create/sync, e.g. a ConfigMap. Sync the application and wait for it to be synced and healthy. Delete the ConfigMap in the cluster and wait for the app to switch to the OutOfSync state. Execute argocd app sync and compare the sync status reported by the UI and the cli and also watch the application CR in the cluster.

Expected behavior

argocd app sync takes the same time to sync as syncing from the UI and doesn't lose gRPC events.

Screenshots

N/A

Version

argocd: v2.5.6+unknown
  BuildDate: 2023-01-12T11:11:45Z
  GitCommit: 
  GitTreeState: 
  GitTag: 2.5.6
  GoVersion: go1.19.4
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.5.2+148d8da
  BuildDate: 2022-11-07T16:42:47Z
  GitCommit: 148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7
  GitTreeState: clean
  GoVersion: go1.18.8
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v4.5.7 2022-08-02T16:35:54Z
  Helm Version: v3.10.1+g9f88ccb
  Kubectl Version: v0.24.2
  Jsonnet Version: v0.18.0

Logs

See above. I didn't find anything helpful in the server logs.

mfabricanti commented 1 year ago

Happening here too in v2.6.0. (argocd: v2.6.0+acc554f)

We use this command: /usr/local/bin/argocd --server **** --grpc-web --auth-token **** app get --refresh --show-operation <appname>

and got: time="2023-02-14T18:50:47Z" level=fatal msg="rpc error: code = Unavailable desc = closing transport due to: connection error: desc = \"error reading from server: EOF\", received prior goaway: code: ENHANCE_YOUR_CALM, debug data: \"too_many_pings\""

SREhero100 commented 1 year ago

I'm having this problem as well. I tried it on v2.5.6 of argocd and earlier versions. Still no dice.

renperez-cpi commented 1 year ago

i am having the same issue as well with the latest version of argocd cli

danielweil commented 1 year ago

Also have this issue, any solution?

miguelvr commented 1 year ago

I'm having the same issue as well. This has been happening for a few versions already.

We are currently on v2.6.7

zepeng811 commented 1 year ago

FYI we upgraded our server to the latest version v2.7.8 and the noticed argocd app sync CLI was significantly faster compared to before.

The same sync previous took 10+ minutes was reduced to under 2 minutes.

maxbrunet commented 9 months ago

Fixed by #15708 I believe

deponian commented 6 months ago

Still broken :( My workaround:

kubectl wait --for=jsonpath='{.status.sync.status}'=Synced -n argocd applications.argoproj.io/kuvaq --timeout 300s
kubectl wait --for=jsonpath='{.status.sync.revision}'=commit-hash-here -n argocd applications.argoproj.io/kuvaq --timeout 300s
kubectl wait --for=jsonpath='{.status.operationState.phase}'=Succeeded -n argocd applications.argoproj.io/kuvaq --timeout 300s
kubectl wait --for=jsonpath='{.status.health.status}'=Healthy -n argocd applications.argoproj.io/kuvaq --timeout 300s
lebe-dev commented 6 months ago

Suuuuuper slow

argocd: v2.11.0+unknown
  BuildDate: 2024-05-08T06:39:46Z
  GitCommit:
  GitTreeState:
  GitTag: 2.11.0
  GoVersion: go1.22.2
  Compiler: gc
  Platform: linux/amd64