argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.01k stars 5.49k forks source link

CMP sidecar reverting to default 60s manifest generation time on create/upsert but not sync #9695

Open notfromstatefarm opened 2 years ago

notfromstatefarm commented 2 years ago

Describe the bug

Our manifests take a little bit over a minute to generate. We've increased the timeouts as follows:

server.repo.server.timeout.seconds: "600"
controller.repo.server.timeout.seconds: "600"

However, randomly the deadline will be the default 60s.

To Reproduce

Increase timeout, generate manifests.

Expected behavior

Timeout is always respected

Version

argocd-server: v2.3.4+ac8b7df
  BuildDate: 2022-05-18T11:41:37Z
  GitCommit: ac8b7df9467ffcc0920b826c62c4b603a7bfed24
  GitTreeState: clean
  GoVersion: go1.17.10
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

Logs

cobalt-infra time="2022-06-16T14:08:35Z" level=info msg="Generating manifests with deadline 9m58.051820106s from now"
cobalt-infra time="2022-06-16T14:09:03Z" level=info msg="Generating manifests with deadline 9m58.137818149s from now"
cobalt-infra time="2022-06-16T14:12:19Z" level=info msg="Generating manifests with deadline 58.197873097s from now"
cobalt-infra time="2022-06-16T14:13:14Z" level=info msg="Generating manifests with deadline 9m58.286776556s from now"
crenshaw-dev commented 2 years ago

Do you have any way of inferring the source of the repo server GenerateManifests call which is using the too-short timeout?

i.e. can you correlate it to controller and API server logs to determine which is sending the wrong timeout? (The plugin respects whatever timeout the caller specifies, adding a small buffer so it can send a response before the caller quits listening.)

notfromstatefarm commented 2 years ago

I just realized that the 60s timeout is occurring when we call argocd app create --upsert and the correct timeout is when we subsequently call argocd app sync.

Basically every time we deploy from our CI, we upsert the application and then trigger a sync. I didn't realize manifest generation occurred with syncs as well.. maybe we shouldn't be syncing after an upsert. Nonetheless the timeout for the create step isn't being respected.

As far as I can tell the only thing making calls to argocd-repo-server is argocd-server. Not sure where to look, exactly.

crenshaw-dev commented 2 years ago

Looking at the API server code, I don't see any direct calls to the repo server on Create/Upsert. So I think it's triggering some behavior in the controller to call the repo server (a different path from what happens on sync).

Unfortunately, I don't have much knowledge of the controller code. Would you be able to bump the controller log level to debug and post, say, the 5s of logs before the incorrect timeout is logged?

notfromstatefarm commented 2 years ago

I had a theory it might be because the Helm charts I'm using only configures the arg --repo-server-timeout on the application controller.. but then I tried configuring the argocd-cmd-params-cm configmap manually, restarted all the pods, and still the same result.

I logged application-controller at debug level and there was nothing logged until the manifests returned successfully and the application was created. The Generating manifests with deadline 59.068003034s from now was printed long before application-controller showed anything.

notfromstatefarm commented 2 years ago

Maybe I did something wrong with the configmap and the server's reposerver timeout isn't being applied, but application controller is thanks to the arg?

apiVersion: v1
kind: ConfigMap
metadata:
  name: argocd-cmd-params-cm
  labels:
    app.kubernetes.io/name: argocd-cmd-params-cm
    app.kubernetes.io/part-of: argocd
data:
  server.repo.server.timeout.seconds: "600"
  controller.repo.server.timeout.seconds: "600"
crenshaw-dev commented 2 years ago

Bummer that the controller logs aren't more verbose. Or maybe it's not actually involved.

Could you bump the API logs to debug and see if there are any messages that seem related to the repo-server call?

notfromstatefarm commented 2 years ago

Nothing there either, I'm afraid. Here's the entirety of the logs from my ArgoCD namespace when I run the argocd app create command, up to the timeout being displayed. This is with debug mode enabled on argocd-server.

[argo-argocd-server-67cd44d6b-pscsw] time="2022-06-16T23:53:56Z" level=info msg="received unary call /application.ApplicationService/Create" grpc.method=Create grpc.request.claims="redacted" grpc.service=application.ApplicationService grpc.start_time="2022-06-16T23:53:56Z" span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=TestRepository grpc.request.deadline="2022-06-16T23:54:57Z" grpc.service=repository.RepoServerService grpc.start_time="2022-06-16T23:53:57Z" grpc.time_ms=1169.986 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="app details cache miss: e8cfde99f2fda24744617c820eb99b764317f289/&ApplicationSource{RepoURL:redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:,redacted"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git fetch origin --tags --force" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=d66f2
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=614.714568
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git checkout --force e8cfde99f2fda24744617c820eb99b764317f289" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=1e603
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git checkout --force e8cfde99f2fda24744617c820eb99b764317f289]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=7.252842
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git clean -fdx" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=2d00e
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git clean -fdx]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=6.954976
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git rev-parse HEAD" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=b1498
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git rev-parse HEAD]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=2.348054
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="app details cache miss: e8cfde99f2fda24744617c820eb99b764317f289/&ApplicationSource{RepoURL:redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{edactedr"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetAppDetails grpc.request.deadline="2022-06-16T23:54:58Z" grpc.service=repository.RepoServerService grpc.start_time="2022-06-16T23:53:58Z" grpc.time_ms=802.348 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="manifest cache miss: &ApplicationSource{redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{redacted},},},Chart:,}/e8cfde99f2fda24744617c820eb99b764317f289"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="git fetch origin --tags --force" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=36796
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="sh -c date >> /tmp/plugin-discover-has-run; echo hello; exit 0" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=a7e65
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=627.6216089999999
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=MatchRepository grpc.request.deadline="2022-06-16T23:54:59Z" grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2022-06-16T23:53:59Z" grpc.time_ms=19.279 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="git checkout --force e8cfde99f2fda24744617c820eb99b764317f289" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=c17fa
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetPluginConfig grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2022-06-16T23:53:59Z" grpc.time_ms=0.053 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg=Trace args="[git checkout --force e8cfde99f2fda24744617c820eb99b764317f289]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=9.422208
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="Generating manifests with deadline 59.056128331s from now"
notfromstatefarm commented 2 years ago

Totally shooting at the hip here, but upon looking for references to GenerateManifest in the repo, there's only a few calls. One uses context.Background() instead of passing along an existing context. Could this be the missing link?

https://github.com/argoproj/argo-cd/blob/master/controller/state.go#L173

notfromstatefarm commented 2 years ago

Okay, I did some more debugging and realized the problem is that the API server timeout wasn't being set.

There are two timeout values - one for the application controller, and one for the API server. It appears the API server timeout is used on creates and the application controller timeout is used on syncs.

Bare with me, I'll try to make this the least amount confusing as possible.

I was originally setting the two values in argocd-cm as described at the bottom of https://argo-cd.readthedocs.io/en/stable/user-guide/config-management-plugins/

I also was setting --repo-server-timeout-seconds on the application controller via the Helm chart values. I mistakenly thought this did the trick.

However, it appears only the command line argument is taking effect. The values in argocd-cm aren't being used at all.

I found the documentation that shows --repo-server-timeout-seconds is also a parameter on the API server. Once I set that, everything is working as expected.

SO.. Long story short:

crenshaw-dev commented 2 years ago

Dang. Incredible research @notfromstatefarm. I don't have time at the moment to put up a patch for the non-working ConfigMap, but I'm setting a reminder for myself to come back to this. Thanks for digging in!

sidewinder12s commented 2 years ago

I'm using the kustomize manifests with verison 2.3.5 and I don't think it's a config issue.

I have:

server.repo.server.timeout.seconds: '180'

Set on my argocd-cmd-params-cm and I can see on the argocd-server controller that it appears to be pulled in and set correctly:

            - name: ARGOCD_SERVER_REPO_SERVER_TIMEOUT_SECONDS
              valueFrom:
                configMapKeyRef:
                  key: server.repo.server.timeout.seconds
                  name: argocd-cmd-params-cm
                  optional: true

Also tried it directly with the flag set and it still doesn't appear to be respecting it.

            - '--repo-server-timeout-seconds'
            - '180'
{"error":"rpc error: code = InvalidArgument desc = application spec for app-name is invalid: InvalidSpecError: Unable to generate manifests in my/app/location: rpc error: code = Canceled desc = context canceled","grpc.code":"InvalidArgument","grpc.method":"Create","grpc.service":"application.ApplicationService","grpc.start_time":"2022-07-13T00:00:12Z","grpc.time_ms":60000.55,"level":"info","msg":"finished unary call with code InvalidArgument","span.kind":"server","system":"grpc","time":"2022-07-13T00:01:12Z"}

Interestingly, this was the output from the repo-server when I had the argocd-cmd-params-cm set and tried to set it with the CLI flag and got the formatting wrong, indicating the binary is getting the param set correctly on init but doesn't appear to be plumbing it everywhere it needs to be:

Error: unknown flag: --repo-server-timeout-seconds 180
Usage:
  argocd-server [flags]
  argocd-server [command]

unknown flag: --repo-server-timeout-seconds 180
Available Commands:
  completion  generate the autocompletion script for the specified shell
  help        Help about any command
  version     Print version information

Flags:
<omitted>
      --repo-server-timeout-seconds int               Repo server RPC call timeout seconds. (default 180)
notfromstatefarm commented 2 years ago

@sidewinder12s The issue in my case was that I was correctly setting --repo-server-timeout-seconds on the argocd-server, but not on the argocd-application-controller. You need to set both server.repo.server.timeout.seconds and controller.repo.server.timeout.seconds in argocd-cmd-params-cm.

I was getting confused because I was using a helm chart that wasn't using this configmap, and the documentation makes it sound like you should be using argocd-cm and not argocd-cmd-params-cm

sidewinder12s commented 2 years ago

Ok. It actually appears that I was hitting the AWS ALB Idle Timeout of 60s.

Once I bumped that, I got no more 504s/context canceled errors. Kinda different/weird that it propagated through like that, but I guess the server is holding open the downstream connection until it gets success from the repo-server

robertLichtnow commented 1 year ago

I was having the same issue with a stack of EKS, ALB Ingress Controller and this ArgoCD Helm Chart. Here's what I had to do to make plugins that took a little too long to run work:

  1. Add the value idle_timeout.timeout_seconds=360 to the alb.ingress.kubernetes.io/load-balancer-attributes annotation in the ArgoCD Ingress resource.
  2. Set the controller.repo.server.timeout.seconds and server.repo.server.timeout.seconds values to 300 in the argocd-cmd-params-cm ConfigMap.
  3. Add --repo-server-timeout-seconds=300 and --request-timeout=300s to ArgoCD's server initialization args. a. This was for the server, not the repo-server.