Open notfromstatefarm opened 2 years ago
Do you have any way of inferring the source of the repo server GenerateManifests
call which is using the too-short timeout?
i.e. can you correlate it to controller and API server logs to determine which is sending the wrong timeout? (The plugin respects whatever timeout the caller specifies, adding a small buffer so it can send a response before the caller quits listening.)
I just realized that the 60s timeout is occurring when we call argocd app create --upsert
and the correct timeout is when we subsequently call argocd app sync
.
Basically every time we deploy from our CI, we upsert the application and then trigger a sync. I didn't realize manifest generation occurred with syncs as well.. maybe we shouldn't be syncing after an upsert. Nonetheless the timeout for the create step isn't being respected.
As far as I can tell the only thing making calls to argocd-repo-server is argocd-server. Not sure where to look, exactly.
Looking at the API server code, I don't see any direct calls to the repo server on Create/Upsert. So I think it's triggering some behavior in the controller to call the repo server (a different path from what happens on sync
).
Unfortunately, I don't have much knowledge of the controller code. Would you be able to bump the controller log level to debug
and post, say, the 5s of logs before the incorrect timeout is logged?
I had a theory it might be because the Helm charts I'm using only configures the arg --repo-server-timeout
on the application controller.. but then I tried configuring the argocd-cmd-params-cm
configmap manually, restarted all the pods, and still the same result.
I logged application-controller at debug level and there was nothing logged until the manifests returned successfully and the application was created. The Generating manifests with deadline 59.068003034s from now
was printed long before application-controller showed anything.
Maybe I did something wrong with the configmap and the server's reposerver timeout isn't being applied, but application controller is thanks to the arg?
apiVersion: v1
kind: ConfigMap
metadata:
name: argocd-cmd-params-cm
labels:
app.kubernetes.io/name: argocd-cmd-params-cm
app.kubernetes.io/part-of: argocd
data:
server.repo.server.timeout.seconds: "600"
controller.repo.server.timeout.seconds: "600"
Bummer that the controller logs aren't more verbose. Or maybe it's not actually involved.
Could you bump the API logs to debug and see if there are any messages that seem related to the repo-server call?
Nothing there either, I'm afraid. Here's the entirety of the logs from my ArgoCD namespace when I run the argocd app create
command, up to the timeout being displayed. This is with debug mode enabled on argocd-server.
[argo-argocd-server-67cd44d6b-pscsw] time="2022-06-16T23:53:56Z" level=info msg="received unary call /application.ApplicationService/Create" grpc.method=Create grpc.request.claims="redacted" grpc.service=application.ApplicationService grpc.start_time="2022-06-16T23:53:56Z" span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=TestRepository grpc.request.deadline="2022-06-16T23:54:57Z" grpc.service=repository.RepoServerService grpc.start_time="2022-06-16T23:53:57Z" grpc.time_ms=1169.986 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="app details cache miss: e8cfde99f2fda24744617c820eb99b764317f289/&ApplicationSource{RepoURL:redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{Name:,redacted"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git fetch origin --tags --force" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=d66f2
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=614.714568
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git checkout --force e8cfde99f2fda24744617c820eb99b764317f289" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=1e603
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git checkout --force e8cfde99f2fda24744617c820eb99b764317f289]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=7.252842
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git clean -fdx" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=2d00e
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git clean -fdx]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=6.954976
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="git rev-parse HEAD" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=b1498
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg=Trace args="[git rev-parse HEAD]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=2.348054
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="app details cache miss: e8cfde99f2fda24744617c820eb99b764317f289/&ApplicationSource{RepoURL:redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{edactedr"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:58Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetAppDetails grpc.request.deadline="2022-06-16T23:54:58Z" grpc.service=repository.RepoServerService grpc.start_time="2022-06-16T23:53:58Z" grpc.time_ms=802.348 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="manifest cache miss: &ApplicationSource{redacted,Path:.,TargetRevision:deploy-to-jake-220609125106,Helm:nil,Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:&ApplicationSourcePlugin{redacted},},},Chart:,}/e8cfde99f2fda24744617c820eb99b764317f289"
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="git fetch origin --tags --force" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=36796
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="sh -c date >> /tmp/plugin-discover-has-run; echo hello; exit 0" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=a7e65
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg=Trace args="[git fetch origin --tags --force]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=627.6216089999999
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=MatchRepository grpc.request.deadline="2022-06-16T23:54:59Z" grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2022-06-16T23:53:59Z" grpc.time_ms=19.279 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg="git checkout --force e8cfde99f2fda24744617c820eb99b764317f289" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b execID=c17fa
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=GetPluginConfig grpc.service=plugin.ConfigManagementPluginService grpc.start_time="2022-06-16T23:53:59Z" grpc.time_ms=0.053 span.kind=server system=grpc
[argo-argocd-repo-server-64dfc86bb4-d9fjq repo-server] time="2022-06-16T23:53:59Z" level=info msg=Trace args="[git checkout --force e8cfde99f2fda24744617c820eb99b764317f289]" dir=/tmp/_argocd-repo/3598c3f8-2355-472e-873f-dec5bbe58d9b operation_name="exec git" time_ms=9.422208
[argo-argocd-repo-server-64dfc86bb4-d9fjq cobalt-infra] time="2022-06-16T23:53:59Z" level=info msg="Generating manifests with deadline 59.056128331s from now"
Totally shooting at the hip here, but upon looking for references to GenerateManifest in the repo, there's only a few calls. One uses context.Background()
instead of passing along an existing context. Could this be the missing link?
https://github.com/argoproj/argo-cd/blob/master/controller/state.go#L173
Okay, I did some more debugging and realized the problem is that the API server timeout wasn't being set.
There are two timeout values - one for the application controller, and one for the API server. It appears the API server timeout is used on creates and the application controller timeout is used on syncs.
Bare with me, I'll try to make this the least amount confusing as possible.
I was originally setting the two values in argocd-cm
as described at the bottom of https://argo-cd.readthedocs.io/en/stable/user-guide/config-management-plugins/
I also was setting --repo-server-timeout-seconds
on the application controller via the Helm chart values. I mistakenly thought this did the trick.
However, it appears only the command line argument is taking effect. The values in argocd-cm
aren't being used at all.
I found the documentation that shows --repo-server-timeout-seconds
is also a parameter on the API server. Once I set that, everything is working as expected.
SO.. Long story short:
argocd-cm
: this does not appear to be the case. I'm pretty sure it's a typo for argocd-cmd-params-cm
since these same values are in there.--repo-server-timeout-seconds
for the application server. This obviously is an issue for the helm chart repo and not this one, so I'll open that tomorrow. Workaround is using extraArgs.Dang. Incredible research @notfromstatefarm. I don't have time at the moment to put up a patch for the non-working ConfigMap, but I'm setting a reminder for myself to come back to this. Thanks for digging in!
I'm using the kustomize manifests with verison 2.3.5 and I don't think it's a config issue.
I have:
server.repo.server.timeout.seconds: '180'
Set on my argocd-cmd-params-cm and I can see on the argocd-server controller that it appears to be pulled in and set correctly:
- name: ARGOCD_SERVER_REPO_SERVER_TIMEOUT_SECONDS
valueFrom:
configMapKeyRef:
key: server.repo.server.timeout.seconds
name: argocd-cmd-params-cm
optional: true
Also tried it directly with the flag set and it still doesn't appear to be respecting it.
- '--repo-server-timeout-seconds'
- '180'
{"error":"rpc error: code = InvalidArgument desc = application spec for app-name is invalid: InvalidSpecError: Unable to generate manifests in my/app/location: rpc error: code = Canceled desc = context canceled","grpc.code":"InvalidArgument","grpc.method":"Create","grpc.service":"application.ApplicationService","grpc.start_time":"2022-07-13T00:00:12Z","grpc.time_ms":60000.55,"level":"info","msg":"finished unary call with code InvalidArgument","span.kind":"server","system":"grpc","time":"2022-07-13T00:01:12Z"}
Interestingly, this was the output from the repo-server when I had the argocd-cmd-params-cm set and tried to set it with the CLI flag and got the formatting wrong, indicating the binary is getting the param set correctly on init but doesn't appear to be plumbing it everywhere it needs to be:
Error: unknown flag: --repo-server-timeout-seconds 180
Usage:
argocd-server [flags]
argocd-server [command]
unknown flag: --repo-server-timeout-seconds 180
Available Commands:
completion generate the autocompletion script for the specified shell
help Help about any command
version Print version information
Flags:
<omitted>
--repo-server-timeout-seconds int Repo server RPC call timeout seconds. (default 180)
@sidewinder12s The issue in my case was that I was correctly setting --repo-server-timeout-seconds on the argocd-server, but not on the argocd-application-controller. You need to set both server.repo.server.timeout.seconds
and controller.repo.server.timeout.seconds
in argocd-cmd-params-cm
.
I was getting confused because I was using a helm chart that wasn't using this configmap, and the documentation makes it sound like you should be using argocd-cm
and not argocd-cmd-params-cm
Ok. It actually appears that I was hitting the AWS ALB Idle Timeout of 60s.
Once I bumped that, I got no more 504s/context canceled errors. Kinda different/weird that it propagated through like that, but I guess the server is holding open the downstream connection until it gets success from the repo-server
I was having the same issue with a stack of EKS, ALB Ingress Controller and this ArgoCD Helm Chart. Here's what I had to do to make plugins that took a little too long to run work:
idle_timeout.timeout_seconds=360
to the alb.ingress.kubernetes.io/load-balancer-attributes
annotation in the ArgoCD Ingress resource.controller.repo.server.timeout.seconds
and server.repo.server.timeout.seconds
values to 300
in the argocd-cmd-params-cm
ConfigMap.--repo-server-timeout-seconds=300
and --request-timeout=300s
to ArgoCD's server initialization args.
a. This was for the server
, not the repo-server
.
Describe the bug
Our manifests take a little bit over a minute to generate. We've increased the timeouts as follows:
However, randomly the deadline will be the default 60s.
To Reproduce
Increase timeout, generate manifests.
Expected behavior
Timeout is always respected
Version
Logs