Open sethgupton-mastery opened 4 months ago
Which component was generating those errors?
Was it the repo-server?
Also do you have details about your App Repos? are they github or some kind of private git server?
Which component was generating those errors?
Was it the repo-server?
application-controller
Also do you have details about your App Repos? are they github or some kind of private git server?
GitHub
We are doing monorepo pattern with webhooks. Although we did notice last week that the webhooks have been timing out with 504s. But I have had not had any time to dig into that yet.
Yeah the port on the server you are calling: Error while dialing: dial tcp 10.0.253.32:8081 is the repo server.
So it looks like the app controller is having trouble calling the repo server when trying to get the manifests of an application.
Makes me think that there is an issue with communicating with Github. Wondering if you can also get logs from your repo server.
Also you might set the log level to debug for a short time (might require restarting the repo server):
Config map: argocd-cmd-params-cm reposerver.log.level: "debug"
So it looks like GH Webhooks only wait 10 seconds for a 2XX response before closing the connection. We see a lot of 499 errors in our nginx logs which we believe a bulk of these are attributed to the Webhook closing it's connection before receiving a response from Argo.
We've seen significant improvement after making the following changes:
reconciliation.timeout
from 180s to 600sthe RPC errors have essentially gone away but we still have some concerns we're addressing. There is A LOT of compute needed for these repo-servers. We continue to see 499 errors in our ingress-nginx logs that seem to come from a combination of GH webhook, client interaction with the UI, and we think from the application controller. We're still digging into that. Our next round of updates aimed at improving performance include:
ARGOCD_K8SCLIENT_RETRY_MAX
from 3 to 9ARGOCD_K8SCLIENT_RETRY_BASE_BACKOFF
to 250ms from 100ms (default)reconciliation.timeout
from 600s to 1200s with a 300s jitterWe are also seeing similar issues very often within our infrastructure. Unfortunately we cannot share much details, but is there anyone looking into this at the moment? Are there any expectations on the resolution?
We are also seeing similar issues very often within our infrastructure. Unfortunately we cannot share much details, but is there anyone looking into this at the moment? Are there any expectations on the resolution?
The original problem had to do with a combination of the repo-server being too busy to service requests from Github actions. Seems they have made some configuration changes to lower the amount of load being placed on the repo-server and decrease these errors. You will need to make similar changes if you have the exact same problem.
Checklist:
argocd version
.Describe the bug
We are seeing a lot of "error reading from server: EOF". Looking at one hour we've had 1.92k. We have a large instance with 9000+ Applications which may be a contributing factor. We have one cluster that ArgoCD runs from and it controls 17 other clusters. We have 17 controllers. We have server.k8sclient.retry.max: "3"
Full error message
ComparisonError Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF
Since we thought it might be related to scaling I stopped by the SIG Scalability meeting and Andrew Lee suggested I create an issue to track this.
He also thought it might be the control plane being overloaded but infrastructure took a look and said the control plane of the ArgoCD cluster looked fine.
To Reproduce
Expected behavior
Have less errors.
Screenshots
Graph of the errors over the last month. 😬![image](https://github.com/argoproj/argo-cd/assets/77691887/a3a1964d-4f0d-439a-b677-f5cbd26065e7)
error as shown in the UI![image](https://github.com/argoproj/argo-cd/assets/77691887/281d4b29-2906-4662-9f53-07a9b7f24c93)
Version
Logs