argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.86k stars 5.45k forks source link

CLI app sync --retry-limit does not work #4505

Open mmckane opened 4 years ago

mmckane commented 4 years ago

Checklist:

Describe the bug

argocd app sync myapp --retry-limit 5 does not work as expected. It does not seem to retry at all if another sync or operation is in progress. If another sync is running already via autosync or manually started in the UI the CLI will error out and exit without completing the sync.

To Reproduce

Start manual sync of an app in the ui, at the same time slightly after the ui sync has started manually sync the app with the command argocd app sync myapp --retry-limit 5. The CLI will not retry the sync and exit with a code of 20 and output the following:

time="2020-10-07T17:51:09-05:00" level=fatal msg="rpc error: code = FailedPrecondition desc = another operation is already in progress"

Expected behavior

CLI retries the sync properly with backoffs and does not exit with an error code.

Version

argocd: v1.7.7+33c93ae
  BuildDate: 2020-09-29T04:59:10Z
  GitCommit: 33c93aea0b9ee3d02fb9703cd82cecce3540e954
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: windows/amd64
argocd-server: v1.7.7+33c93ae
  BuildDate: 2020-09-29T04:56:23Z
  GitCommit: 33c93aea0b9ee3d02fb9703cd82cecce3540e954
  GitTreeState: clean
  GoVersion: go1.14.1
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: {Version:kustomize/v3.6.1 GitCommit:c97fa946d576eb6ed559f17f2ac43b3b5a8d5dbd BuildDate:2020-05-27T20:47:35Z GoOs:linux GoArch:amd64}
  Helm Version: version.BuildInfo{Version:"v3.2.0", GitCommit:"e11b7ce3b12db2941e90399e874513fbd24bcb71", GitTreeState:"clean", GoVersion:"go1.13.10"}
  Kubectl Version: v1.17.8
mmckane commented 4 years ago

Not familiar enough with the code base yet to make a PR to fix this. But to triage this for someone else it appears that the CLI calls this Function which always exits with an error if an operation is in progress.

jessesuen commented 4 years ago

This is actually unrelated to --retry-limit. The error is another operation is already in progress, which is current design of Argo CD (we do not allow two operations to happen, nor do we allowed operations to queue up).

The workaround for this is:

argocd app wait APPNAME --operation && argocd app sync APPNAME
mmckane commented 4 years ago

So we are using that workaround but are still seeing errors 10-20% of the time in our pipeline. I was hoping this switch would help. Is there any plans to allow queuing or retry logic into the cli in the case a pipeline hits this error?

Is the solution to just turn of autosync so our deploy pipeline doesn't take this error, or are there other operations that could also result in this error?

boolafish commented 3 years ago

+1 for this. Seeing this error too.

cbl315 commented 3 years ago

Any update about this issue? Argocd version: v1.8.4

mmckane commented 3 years ago

We ended up Turning off autosync and it seems to have mitigated the issue a bit. We still have problems around app of apps that can be syncd by multiple deployments fail because another deploy/sync is happening at the same time.

cbl315 commented 3 years ago

We ended up Turning off autosync and it seems to have mitigated the issue a bit. We still have problems around app of apps that can be syncd by multiple deployments fail because another deploy/sync is happening at the same time.

Turn off the autosync seems not elegant, hope there is a better way.

robermar23 commented 3 years ago

We have the same issue.

Running ver 2

Multiple microservices handled by the same ArgoCD Application.

Builds start on commit. Each build calls: argocd app wait APPNAME --operation && argocd app sync APPNAME

If multiple builds are "waiting", once the first sync completes, the rest attempt to kick off their own sync at the same time, returning the same "operation already in progress" error.

Is selective sync our only solution here?

andrewm-aero commented 3 years ago

+1, getting bit by this too.

Given that ArgoCD is founded on the concept of declarative management, it seems bewildering that there's no single operation that says "Wait until synced to the latest, do whatever you need to ensure that happens, only fail if that is impossible or takes too long". In order to get pipelines which don't spuriously fail, we've been reduced to scraping the log output for that particular error message, and log scraping is generally a sign that something has gone wrong at a fundamental level.

tdongsi commented 3 years ago

+1, getting this problem.

We want to have parallel pipelines to do argocd app sync and, optionally, argocd app rollback if there is a problem.

argocd app wait APPNAME --operation can only help if there are two active parallel parallel pipelines. More than that, we have the same problem as described by @robermar23 above.

If multiple builds are "waiting", once the first sync completes, the rest attempt to kick off their own sync at the same time, returning the same "operation already in progress" error.

tdongsi commented 3 years ago

The workaround for me is to discard argocd app wait entirely and coordinate ArgoCD access (i.e., any argocd app commands) with some lock service in CI system.

For example: My CI system happen to be Jenkins, so the Jenkins-specific solution looks like this in Jenkinsfile:

    def jobs = [:]
    for (String app: apps) {
      jobs[app] = {
        lock('service/argocd') {
          sh "argocd app sync $app"
        }        
      }
    }

    parallel jobs

In this case, multiple pipelines (and their forks due to parallel steps) will wait for and obtain the lock named service/argocd before proceeding with argocd app sync command.