argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
16.83k stars 5.1k forks source link

Getting a lot of "error reading from server: EOF" #17546

Open sethgupton-mastery opened 4 months ago

sethgupton-mastery commented 4 months ago

Checklist:

Describe the bug

We are seeing a lot of "error reading from server: EOF". Looking at one hour we've had 1.92k. We have a large instance with 9000+ Applications which may be a contributing factor. We have one cluster that ArgoCD runs from and it controls 17 other clusters. We have 17 controllers. We have server.k8sclient.retry.max: "3"

Full error message ComparisonError Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF

Since we thought it might be related to scaling I stopped by the SIG Scalability meeting and Andrew Lee suggested I create an issue to track this.

He also thought it might be the control plane being overloaded but infrastructure took a look and said the control plane of the ArgoCD cluster looked fine.

To Reproduce

Expected behavior

Have less errors.

Screenshots

Graph of the errors over the last month. 😬 image

error as shown in the UI image

Version

argocd-server: v2.10.1+a79e0ea
  BuildDate: 2024-02-14T17:37:43Z
  GitCommit: a79e0eaca415461dc36615470cecc25d6d38cefb
  GitTreeState: clean
  GoVersion: go1.21.3
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.2.1 2023-10-19T20:13:51Z
  Helm Version: v3.14.0+g3fc9f4b
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

Logs


11:00:08.380
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:57:38Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:57:38Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 092f2d18799cea1017442cddba67471acdb42d0f052aba26447b1c909dc5787c cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: dc39a721c7cd15e0d21abe2ef44b983470fbddc21293d99e555edf798a156a55 host: aks-generalb-37778483-vmss00000h labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-79b958f585 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-2 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 7bbd7a70-ef05-41a9-a98b-ef56ae2e605f pod_name: argocd-application-controller-2 stream: stderr time: 2024-03-15T16:00:08.380704895Z timestamp: 1710518408380 

11:00:08.381
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 092f2d18799cea1017442cddba67471acdb42d0f052aba26447b1c909dc5787c cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: 02b66a64fef3c748aa76a317617e93323e205193eada0e4bd4af1034f9e1895c host: aks-generalb-37778483-vmss000001 labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-79b958f585 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-9 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 0da033c0-8bec-41e5-969f-e894c99353ed pod_name: argocd-application-controller-9 stream: stderr time: 2024-03-15T16:00:08.3814418Z timestamp: 1710518408381 

11:00:08.394
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:59:13Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:59:13Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 404e231545d2b5ddeeec5c5db3ef9230f1e6ca442d95dd2788c1f5ebb084582e cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: f6be1e20747983eca459b66832dcef4dcb9564600cbf196e18d50f533c197298 host: aks-systemd-31993041-vmss000000 labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-ddbbd89c5 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-13 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 4bd600cd-c276-4670-a605-4f3ad265fad8 pod_name: argocd-application-controller-13 stream: stderr time: 2024-03-15T16:00:08.394942606Z timestamp: 1710518408394 

11:00:08.401
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:07Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:07Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 092f2d18799cea1017442cddba67471acdb42d0f052aba26447b1c909dc5787c cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: dc39a721c7cd15e0d21abe2ef44b983470fbddc21293d99e555edf798a156a55 host: aks-generalb-37778483-vmss00000h labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-79b958f585 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-2 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 7bbd7a70-ef05-41a9-a98b-ef56ae2e605f pod_name: argocd-application-controller-2 stream: stderr time: 2024-03-15T16:00:08.401300838Z timestamp: 1710518408401 

11:00:08.411
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:58:44Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T15:58:44Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = error reading from server: EOF\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 404e231545d2b5ddeeec5c5db3ef9230f1e6ca442d95dd2788c1f5ebb084582e cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: f6be1e20747983eca459b66832dcef4dcb9564600cbf196e18d50f533c197298 host: aks-systemd-31993041-vmss000000 labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-ddbbd89c5 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-13 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 4bd600cd-c276-4670-a605-4f3ad265fad8 pod_name: argocd-application-controller-13 stream: stderr time: 2024-03-15T16:00:08.411189232Z timestamp: 1710518408411 

11:00:08.433
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 092f2d18799cea1017442cddba67471acdb42d0f052aba26447b1c909dc5787c cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: 3e6156ed03b25135d36f7b047d7daa90e3afc3de2b18d4d2dc25bbea3c3f3f9d host: aks-generalb-37778483-vmss000001 labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-79b958f585 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-8 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 154a813d-6786-4b40-8ae0-05e817637ba5 pod_name: argocd-application-controller-8 stream: stderr time: 2024-03-15T16:00:08.43332946Z timestamp: 1710518408433 

11:00:08.434
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:07Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:07Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: 092f2d18799cea1017442cddba67471acdb42d0f052aba26447b1c909dc5787c cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: 9b168c2a588ba5bcc8203c34d67f9f58a4beb5ab5b1c7218f6a67cab9f9e1a12 host: aks-generalb-37778483-vmss00000h labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-79b958f585 labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-3 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 327c563c-706a-49ef-b7c8-163a4e80a0fb pod_name: argocd-application-controller-3 stream: stderr time: 2024-03-15T16:00:08.434385725Z timestamp: 1710518408434 

11:00:08.531
time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip>
message: time="2024-03-15T16:00:08Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-03-15T16:00:08Z\",\"message\":\"Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unavailable desc = connection error: desc = \\\"transport: Error while dialing: dial tcp 10.0.253.32:8081: connect: connection refused\\\"\",\"type\":\"ComparisonError\"}]}}" application=<snip> _p: F annotations.checksum/cm: 26439947f5da3e5307e31fe8e9c0b4ab7213e087c5524273e4d3a2a527ce0f71 annotations.checksum/cmd-params: b83df4e1a4a70678015c4cb9d2a0aed37be0484994fda145e22693026b7d595a cluster_name: aks-moy23-prod container_hash: quay.io/argoproj/argocd@sha256:5f1de1b4d959868c1e006e08d46361c8f019d9730e74bc1feeab8c7b413f1187 container_image: quay.io/argoproj/argocd:v2.10.1 container_name: application-controller docker_id: 82d624292518d223432d71b024bcaf4d00273418bab4bd4689b37af91bcbec92 host: aks-thanossgb-42326671-vmss000000 labels.app.kubernetes.io/component: application-controller labels.app.kubernetes.io/instance: argocd labels.app.kubernetes.io/managed-by: Helm labels.app.kubernetes.io/name: argocd-application-controller labels.app.kubernetes.io/part-of: argocd labels.app.kubernetes.io/version: v2.10.1 labels.controller-revision-hash: argocd-application-controller-c98d4dcdd labels.helm.sh/chart: argo-cd-6.0.14 labels.statefulset.kubernetes.io/pod-name: argocd-application-controller-15 namespace_name: argocd newrelic.logPattern: nr.DID_NOT_MATCH newrelic.source: api.logs plugin.source: kubernetes plugin.type: fluent-bit plugin.version: 1.19.0 pod_id: 1385b5c3-6806-4f95-8f3a-279b91b4069c pod_name: argocd-application-controller-15 stream: stderr time: 2024-03-15T16:00:08.531596107Z timestamp: 1710518408531 
andklee commented 4 months ago

Which component was generating those errors?

Was it the repo-server?

andklee commented 4 months ago

Also do you have details about your App Repos? are they github or some kind of private git server?

sethgupton-mastery commented 4 months ago

Which component was generating those errors?

Was it the repo-server?

application-controller

Also do you have details about your App Repos? are they github or some kind of private git server?

GitHub

We are doing monorepo pattern with webhooks. Although we did notice last week that the webhooks have been timing out with 504s. But I have had not had any time to dig into that yet.

andklee commented 4 months ago

Yeah the port on the server you are calling: Error while dialing: dial tcp 10.0.253.32:8081 is the repo server.

So it looks like the app controller is having trouble calling the repo server when trying to get the manifests of an application.

Makes me think that there is an issue with communicating with Github. Wondering if you can also get logs from your repo server.

Also you might set the log level to debug for a short time (might require restarting the repo server):

Config map: argocd-cmd-params-cm reposerver.log.level: "debug"

Brett-Nugent commented 4 months ago

So it looks like GH Webhooks only wait 10 seconds for a 2XX response before closing the connection. We see a lot of 499 errors in our nginx logs which we believe a bulk of these are attributed to the Webhook closing it's connection before receiving a response from Argo.

Brett-Nugent commented 4 months ago

We've seen significant improvement after making the following changes:

the RPC errors have essentially gone away but we still have some concerns we're addressing. There is A LOT of compute needed for these repo-servers. We continue to see 499 errors in our ingress-nginx logs that seem to come from a combination of GH webhook, client interaction with the UI, and we think from the application controller. We're still digging into that. Our next round of updates aimed at improving performance include:

miguelofbc commented 1 month ago

We are also seeing similar issues very often within our infrastructure. Unfortunately we cannot share much details, but is there anyone looking into this at the moment? Are there any expectations on the resolution?

Enclavet commented 1 month ago

We are also seeing similar issues very often within our infrastructure. Unfortunately we cannot share much details, but is there anyone looking into this at the moment? Are there any expectations on the resolution?

The original problem had to do with a combination of the repo-server being too busy to service requests from Github actions. Seems they have made some configuration changes to lower the amount of load being placed on the repo-server and decrease these errors. You will need to make similar changes if you have the exact same problem.