RamenDR / ramen

Apache License 2.0
70 stars 51 forks source link

Fix argocd create app idempotency #1468

Closed nirs closed 2 weeks ago

nirs commented 2 weeks ago

If we fail in the middle of the argocd test (e.g. timeout waiting for the test app), the next attempt will fail instead of continuing the test.

I reproduced the issue using this change:

diff --git a/test/addons/argocd/test b/test/addons/argocd/test
index b41e7877..9273f805 100755
--- a/test/addons/argocd/test
+++ b/test/addons/argocd/test
@@ -111,10 +111,12 @@ if len(sys.argv) != 4:
 hub, *clusters = sys.argv[1:]

 for cluster in clusters:
     deploy_guestbook(hub, cluster)

+sys.exit("simulating failure")
+
 for cluster in clusters:
     wait_until_guestbook_is_healthy(hub, cluster)

 for cluster in clusters:
     undeploy_guestbook(hub, cluster)

Example run reproducing the issue:

$ addons/argocd/test hub dr1 dr2
Deploying application guestbook-dr1 in namespace argocd-test on cluster dr1
application 'guestbook-dr1' created
Deploying application guestbook-dr2 in namespace argocd-test on cluster dr2
application 'guestbook-dr2' created
simulating failure

$ addons/argocd/test hub dr1 dr2
Deploying application guestbook-dr1 in namespace argocd-test on cluster dr1
Traceback (most recent call last):
  File "/home/nsoffer/src/ramen/test/addons/argocd/test", line 114, in <module>
    deploy_guestbook(hub, cluster)
  File "/home/nsoffer/src/ramen/test/addons/argocd/test", line 30, in deploy_guestbook
    for line in commands.watch(
  File "/home/nsoffer/src/ramen/test/drenv/commands.py", line 190, in watch
    raise Error(args, error, exitcode=p.returncode)
drenv.commands.Error: Command failed:
   command: ('argocd', 'app', 'create', 'guestbook-dr1',
       '--repo=https://github.com/argoproj/argocd-example-apps.git', '--path=guestbook',
       '--dest-name=dr1', '--dest-namespace=argocd-test', '--sync-option=CreateNamespace=true',
       '--sync-policy=automated')
   exitcode: 20
   error:
      time="2024-06-21T17:08:53+03:00" level=fatal msg="rpc error: code = InvalidArgument desc =
      existing application spec is different, use upsert flag to force update"

Adding the --upsert flag fixes the issue.

The command should be idempotent if the desired spec is the same as the existing spec, which should be the case in the test, but for some reason the spec differ. We need to open argocd issue for this.

nirs commented 2 weeks ago

Added another commit to increase argocd test wait timeout, since ti fails consistently now.

nirs commented 2 weeks ago

CI run succeeded on the first try with 120 seconds timeout: https://github.com/RamenDR/ramen/actions/runs/9616066710

With 60 seconds timeout the first try failed and the second succeeded: https://github.com/RamenDR/ramen/actions/runs/9615087975/job/26523802354

I think the issue is running at the same time we deploy ceph cluster and pool, which is very cpu intensive for some reason.