i/o timeout errors in redis and argocd-repo-server PODs

Cloud-Mak commented 3 years ago

Hi All,

I exploring the argoCD. Its quite neat project. I have deployed argoCD on K8 1.17 cluster (1 master, 2 workers) running over 3 LXD containers. I could use other stuff like metallb, ingress, rancher etc fine with this cluster.

For some reason, my argoCD isn't working the expected way. I was abe to get argoCD UI login working by using bypass method in bug 4148 that I reported earlier.

Here are svcs in argoCD ns

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-dex-server       ClusterIP   10.102.198.189   <none>        5556/TCP,5557/TCP,5558/TCP   25h
service/argocd-metrics          ClusterIP   10.104.80.68     <none>        8082/TCP                     25h
service/argocd-redis            ClusterIP   10.105.201.92    <none>        6379/TCP                     25h
service/argocd-repo-server      ClusterIP   10.98.76.94      <none>        8081/TCP,8084/TCP            25h
service/argocd-server           NodePort    10.101.169.46    <none>        80:32046/TCP,443:31275/TCP   25h
service/argocd-server-metrics   ClusterIP   10.107.61.179    <none>        8083/TCP                     25h

After I got my UI, I tried creating a new sample project from GUI, it failed. Below are the logs from during that time for argo-server

time="2020-08-27T09:22:21Z" level=info msg="received unary call /repository.RepositoryService/List" grpc.method=List grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content= grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=0.318 span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=project.ProjectService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=3.441 span.kind=server system=grpc
time="2020-08-27T09:23:52Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" span.kind=server system=grpc
_time="2020-08-27T09:26:39Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" grpc.time_ms=167124.3 span.kind=server system=grpc_
time="2020-08-27T09:28:16Z" level=info msg="Alloc=10005 TotalAlloc=1978587 Sys=71760 NumGC=257 Goroutines=158"
time="2020-08-27T09:28:31Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" span.kind=server system=grpc
2020/08/27 09:28:48 proto: tag has too few fields: "-"
time="2020-08-27T09:28:48Z" level=info msg="received unary call /application.ApplicationService/Create" grpc.method=Create grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="application:<TypeMeta:<kind:\"\" apiVersion:\"\" > metadata:<name:\"app1\" generateName:\"\" namespace:\"\" selfLink:\"\" uid:\"\" resourceVersion:\"\" generation:0 creationTimestamp:<0001-01-01T00:00:00Z> clusterName:\"\" > spec:<source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > destination:<server:\"https://kubernetes.default.svc\" namespace:\"default\" > project:\"default\" > status:<sync:<status:\"\" comparedTo:<source:<repoURL:\"\" path:\"\" targetRevision:\"\" chart:\"\" > destination:<server:\"\" namespace:\"\" > > revision:\"\" > health:<status:\"\" message:\"\" > sourceType:\"\" summary:<> > > " grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:29Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = application spec is invalid: InvalidSpecError: Unable to get app details: rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=InvalidArgument grpc.method=Create grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" grpc.time_ms=161011.11 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=warning msg="finished unary call with code DeadlineExceeded" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=DeadlineExceeded grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" grpc.time_ms=182001.84 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:33Z" span.kind=server system=grpc
time="2020-08-27T09:33:31Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" grpc.time_ms=140004.14 span.kind=server system=grpc

I even tried creating app using the declarating way. Created this yaml and applied mainfest using kubctl apply -f method. This created a appp visible in GUI, but it was never deployed. The health status eventually became healthy, but the sync status remained unknown.

From GUI, I can see below errors under applications conditions one after another

ComparisonError
rpc error: code = DeadlineExceeded desc = context deadline exceeded

ComparisonError
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout"

While I tried deleting the app from GUI, it was stuck in deleting with below error visible under events in GUI

DeletionError
dial tcp 10.105.201.92:6379: i/o timeout
Unable to load data: dial tcp 10.105.201.92:6379: i/o timeout
Unable to delete application resources: dial tcp 10.105.201.92:6379: i/o timeout

As of now, nothing is working for my in argoCD. I am clueless as to what to do now.

jessesuen commented 3 years ago

Unable to delete application resources: dial tcp 10.105.201.92:6379: i/o timeout

This is an indication that Argo CD cannot talk to the k8s API server and I think this may be environmental. Can you confirm the application controller is able to reach the managed cluster's API server?

Cloud-Mak commented 3 years ago

Can you confirm the application controller is able to reach the managed cluster's API server?

Hi, Thanks for the reply. Can you confirm how exactly I do that?

Cloud-Mak commented 3 years ago

Just asking method to verify coz- I could kube exec into app-conroller pod. It chooses non-root user - "argocd" for login in pod. Its debain buster container, where i can't install ping or ever sudo (to install ping). The plan was to ping kube api server IP (which is K8 master IP) to see if there is communication between two.

argocd@argocd-application-controller-d9d496bdc-hcv7t:~$ cat /etc/os-release 
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"

Leke-Ariyo commented 3 years ago

hello, I have this same issue. How were you able to fix?

XinCai commented 3 years ago

i have this same issue, how were you able to fix?

vikas027 commented 3 years ago

I saw the same issue in v2.0.1 as well, restarting all the pods fixed the issue but not sure what is the cause of it.

gtriggiano commented 3 years ago

I'm facing the same issue.

I tried to delete ns argocd && kubectl apply ... many times, trying with versions 2.0.0, 2.0.1, 2.0.2 and 2.0.3.

The result is always the same. It seems that appication-controller cannot connect to redis:

time="2021-06-08T22:51:22Z" level=info msg="Processing all cluster shards"
time="2021-06-08T22:51:22Z" level=info msg="appResyncPeriod=3m0s"
time="2021-06-08T22:51:22Z" level=info msg="Application Controller (version: v2.0.2+9a7b0bc, built: 2021-05-20T19:30:25Z) starting (namespace: argocd)"
time="2021-06-08T22:51:22Z" level=info msg="Starting configmap/secret informers"
time="2021-06-08T22:51:22Z" level=info msg="Configmap/secret informer synced"
time="2021-06-08T22:51:22Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-08T22:51:22Z" level=info msg="0xc000186a20 subscribed to settings updates"
time="2021-06-08T22:51:22Z" level=info msg="Starting clusterSecretInformer informers"
time="2021-06-08T22:51:23Z" level=info msg="Notifying 1 settings subscribers: [0xc000186a20]"
time="2021-06-08T22:51:23Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-08T22:51:42Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"
time="2021-06-08T22:52:13Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"
time="2021-06-08T22:52:33Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"
time="2021-06-08T22:52:53Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"
time="2021-06-08T22:53:13Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"
time="2021-06-08T22:53:33Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.70.220:6379: i/o timeout"

Restarting containers does not help.

I can confirm that I installed just the provided manifests for the mentioned versions. Moreover argocd-redis:6379 is reachable from everywere in the cluster (given you provide ingress to it with a NetworkPolicy) and works fine.

The apps are stuck in an unknown state, although some of them become eventually healthy.

I'm totally clueless about the possible root cause.

Until this afternoon I had a working installation of v2.0.2, which I deployed few days ago as the last step of a migration path started from v1.5.

I noticed in the last few days (through prometheus/grafana) that the application-controller was requiring 3X RAM and CPU in comparison of what it used to require until I switched to v2. Initially I thought it could be a legitimate behavior possibly due to refactoring/new features. I eventually became suspicious when I noticed the longer refresh times for apps and looked into redis (which is deployed as an ephemeral container) and realized it was empty. Then i discovered the aforementioned logs in application-controller.

I then decided to deploy v2.0.3 hoping it could solve the issue, but from that point onward Argo definitely ceased to work correctly.

Please help. Thanks.

gtriggiano commented 3 years ago

All applications have this error shown in the UI

I also tried to set --repo-server-timeout-seconds to values like 420 or 600, but had no success.

jrhoward commented 3 years ago

I had the same errors as @gtriggiano I replaced the image tag for the redis deployment to 6.2.4 in the helm chart, Note: without the alpine extension, and those errors disappeared.

redis:
   image:
     tag: '6.2.4'

gtriggiano commented 3 years ago

Thanks for the hint @jrhoward, but unfortunately this did not solve the issue for me 😞

It seems that no one of the Argo services can talk with redis.

argocd-server has logs like:

time="2021-06-09T09:42:15Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.248.24:6379: i/o timeout"

These are the startup logs of argocd-application-controller:

time="2021-06-09T09:40:06Z" level=info msg="Processing all cluster shards"
time="2021-06-09T09:40:06Z" level=info msg="appResyncPeriod=3m0s"
time="2021-06-09T09:40:06Z" level=info msg="Application Controller (version: v2.0.3+8d2b13d, built: 2021-05-27T17:38:37Z) starting (namespace: argocd)"
time="2021-06-09T09:40:06Z" level=info msg="Starting configmap/secret informers"
time="2021-06-09T09:40:06Z" level=info msg="Configmap/secret informer synced"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="0xc00097b1a0 subscribed to settings updates"
time="2021-06-09T09:40:06Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Starting clusterSecretInformer informers"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: drone)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc"
W0609 09:40:06.719086       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0609 09:40:06.815954       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-06-09T09:40:26Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:40:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:36Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:37Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
...

It seems also that services cannot talk to each other. I found this in argocd-server logs:

time="2021-06-09T08:58:15Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.55.98:8081: connect: connection refused\"" grpc.code=Unavailable grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2021-06-09T08:58:13Z" grpc.time_ms=2007.533 span.kind=server system=grpc

where 10.43.55.98 is the ClusterIP of the argocd-repo-server service.

I'm very puzzled.

jrhoward commented 3 years ago

I spoke too soon. The errors are back

jrhoward commented 3 years ago

Ok on delving deeper into my issue it was actually an SDN issue. I'm running on bare metal. Machines could not reach CoreDNS if they were not on the same machine, if they were on the same machine they couldn't reach the Redis Server if it was on another machine, so a mixture of DNS lookup failures and network connectivity to Redis

irizzant commented 3 years ago

In my case DNS was working fine and I was able to ping Redis master, and I'm also running k8s on bare metal with Kubespray.

I had no luck restarting the pods either and I had ArgoCD managed with ArgoCD itself, so I decided to follow the following procedure (read carefully before doing anything):

Read very carefully https://argoproj.github.io/argo-cd/operator-manual/disaster_recovery/ and create a backup of ArgoCD (just in case)
Delete ArgoCD statefulsets and deployments with: kubectl -n argocd delete deployments,statefulsets --all
Recreated the ArgoCD missing resources using the GitOps repo of the cluster
I restored ArgoCD status using the procedure described at point 1.

That brought back ArgoCD and the applications were left intact in my cluster.

Regarding the error Failed to save clusters info, I have no clue about what the cause could be.

rexhsu1968 commented 3 years ago

I got the same error happened at v2.0.x too. And tried several versions all got same error. Finally I remove all of the NetworkPolicy then all become working without any error. I just guess that NetworkPolicy restrict network traffic by the podSelector, but controller and server connect redis by the service IP:6379 port.

erkerb4 commented 3 years ago

I've experienced a similar issue, where removing the NetworkPolicy for redis temporarily restored the connectivity. I restored the NetworkPolicy, then restarted the CNI agents on nodes running redis and argocd-server (cilium in my case), and connectivity was restored. I'd be cautious before restarting the cni agents. There was a blip in service communication (as expected). Proceed with caution.

jmaaks commented 2 years ago

I had been struggling with similar problems this week getting ArgoCD working in a small Vagrant/VirtualBox environment. I switched from Flannel to Calico and everything just magically started working.

IgorGee commented 2 years ago

Yep, same here. Went from K3s Flannel to Calico and all issues are gone.

matteodamico commented 2 years ago

I've same issue even if I have no NetworkPolicy running into cluster.I'm running argo in a minikube k8s'installation.

logileifs commented 2 years ago

I had the same issue and the cause of the problem was network policies

nlamirault commented 2 years ago

I'm on a k3s ARM64 cluster. I'v got the same error.

Like @jrhoward i change Redis image :

redis:
   image:
     tag: '6.2.4'

Connexion to the cluster works fine now.

azamafzaal commented 2 years ago

Hi, i am also facing the same problem, any solution yet ?

jrhoward commented 2 years ago

Given there were so many different issues I don't believe the problem is with ArgoCD.

suseendare commented 2 years ago

Hi, I am also facing the same problem. Any new workarounds?

gimpiron commented 2 years ago

i would like to add regarding argocd - i have installed argocd with helm, this issue accured when i missused the vaules file with this section:

server:
  extraArgs:
    - --insecure

after removing it, all my problems were gone. thats super weird and not possible to findout based on the redis error message. hope this helps for some of you

sebandgo commented 2 years ago

Removing Network Policies helped in my case. I'm using Weave with no other Policies, AgroCD ones were the only NPs there.

IT-Luka commented 2 years ago

In my case the problem was that terraform was overriding the default AWS EKS security groups (allow all), and so the server pod couldn't communicate with the redis pod. When I added the correct security groups everything started to work as expected.

To help yourself diagnose this problem use "kubectl logs pod/argocd-server...". Using that I could see that the server pod was timing out when trying to connect to the redis pod, and that helped me narrow it down.

przemolb commented 2 years ago

We don't use Network Policies (yet) but this error occurs in our installation and we haven't installed ArgoCD with helm (just simple kubectl apply ...) Any idea how to fix this ?

EdwinWalela commented 2 years ago

Our cluster is using Weavenet as it's CNI. I resolved the issue by deleting and reaplying Weavenet CRDs

delete

kubectl delete -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

apply

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

We recently reconfigured our local DNS and I believe that could have been the cause of the problem

ruben-herold commented 2 years ago

running into the same problem on a fresh cluster:

export KUBECONFIG=/etc/kubernetes/admin.conf create Ippool for argo :

apiVersion: projectcalico.org/v3 kind: IPPool metadata: name: argcd-ipv6-ippool spec: allowedUses:

Workload
Tunnel
blockSize: 122
cidr: 2a00:a000:1002:12::/64
ipipMode: Never
nodeSelector: all()
vxlanMode: Never

kubectl create namespace argocd kubectl annotate namespace argocd "[cni.projectcalico.org/ipv6pools"='["argcd-ipv6-ippool"]]

then:

kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.3.3/manifests/install.yaml

kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d; echo ""

after this I could login into argocd but the cluster was not connected so I did:

install argocd cli export KUBECONFIG=/etc/kubernetes/admin.conf argocd --insecure login [argocd-server.argocd.svc.k8s01.rcs-networks.com:443]( argocd cluster add kubernetes-admin@kubernetes --in-cluster --upsert -y after this cluster is connected and shows version number and so on but if I try to add an application:

export KUBECONFIG=/etc/kubernetes/admin.conf ./argocd app create guestbook --repo https://github.com/argoproj/argocd-example-apps.git --path guestbook --dest-namespace default --dest-server https://kubernetes.default.svc/ --directory-recurse I get :

FATA[0060] rpc error: code = InvalidArgument desc = application spec for guestbook is invalid: InvalidSpecError: repository not accessible: rpc error: code = DeadlineExceeded desc = context deadline exceeded

repo ist reachable If I do:

kubectl -n argocd exec --stdin --tty argocd-repo-server-5569c7b657-2sj98 -- /bin/sh cd /tmp git clone https://github.com/argoproj/argocd-example-apps.git

it works

After deleting all networkpolicies from argocd namespace it was running fine...

ruben-herold commented 2 years ago

hi I did some testing with delete step by step all polices. The policy which resolves my problem was deleting the argocd-repo-server-network-policy

ihatemodels commented 2 years ago

hi I did some testing with delete step by step all polices. The policy which resolves my problem was deleting the argocd-repo-server-network-policy

This works for me as well. Thank you !

cmstack commented 2 years ago

I had to delete 2 network policies for this to work on my cluster:

argocd-repo-server-network-policy
argocd-server-network-policy

The first resolved this error: Unable to create application: application spec for nginx-web is invalid: InvalidSpecError: repository not accessible: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.107.52.220:8081: i/o timeout"

The second resolved this error: Unable to create application: application spec for nginx-web is invalid: InvalidSpecError: Unable to get app details: rpc error: code = DeadlineExceeded desc = context deadline exceeded

ZeroDeth commented 2 years ago

In my case the problem was that terraform was overriding the default AWS EKS security groups (allow all), and so the server pod couldn't communicate with the redis pod. When I added the correct security groups everything started to work as expected.

To help yourself diagnose this problem use "kubectl logs pod/argocd-server...". Using that I could see that the server pod was timing out when trying to connect to the redis pod, and that helped me narrow it down.

Solved it. Thanks

glyhood commented 2 years ago

I've experienced a similar issue, where removing the NetworkPolicy for redis temporarily restored the connectivity. I restored the NetworkPolicy, then restarted the CNI agents on nodes running redis and argocd-server (cilium in my case), and connectivity was restored. I'd be cautious before restarting the cni agents. There was a blip in service communication (as expected). Proceed with caution.

This worked for me.

modify the nodes security group with the right permissions (inbound and outbound)

optional:

removing the NetworkPolicy for redis
restored the NetworkPolicy
restarted the CNI agents on nodes running redis and argocd-server

wbarnard81 commented 2 years ago

My Setup: HA K3s cluster v1.23.6+k3s1 / 3servers & 12workers Metallb v0.12.1 Calico CNI Longhorn

I did a new deploy of ArgoCD, but was unable to add my repositories. Been struggling with this for the past 2 days and then I came across this post. I check the logs and I am also getting the timeout errors:

time="2022-05-12T09:17:35Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.231.95:6379: i/o timeout"
time="2022-05-12T09:18:35Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.231.95:6379: i/o timeout"

I have tried restarting the pods and moving them to other workers, but nothing worked.

When I delete this network policy

argocd-repo-server-network-policy

Then my repos connect, but I still cannot add an application in the UI. I am also still getting the timeout errors on argocd-server...

time="2022-05-12T09:32:56Z" level=warning msg="getConnectionState cache set error git@xxxxx:xxxxx/guestbook.git: dial tcp 10.43.231.95:6379: i/o timeout"
time="2022-05-12T09:32:56Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=repository.RepositoryService grpc.start_time="2022-05-12T09:32:53Z" grpc.time_ms=2997.756 span.kind=server system=grpc
time="2022-05-12T09:33:36Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.231.95:6379: i/o timeout"

Deleting the network policy:

argocd-server-network-policy

Allows me now to create an application through the UI. Also no more timeout errors.

Edit: Had to delete this policy as well, for the UI to show the application correctly.

argocd-redis-network-policy

VamKon commented 2 years ago

I've got the same issue on a AKS v 1.21.7. I had to remove all the argo networkpolicies for it to work. I would like to keep them though, further research needed.

mauricemojito commented 2 years ago

In my case the problem was that terraform was overriding the default AWS EKS security groups (allow all), and so the server pod couldn't communicate with the redis pod. When I added the correct security groups everything started to work as expected.

To help yourself diagnose this problem use "kubectl logs pod/argocd-server...". Using that I could see that the server pod was timing out when trying to connect to the redis pod, and that helped me narrow it down.

You saved my day :)

jeehunseo commented 2 years ago

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn't only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

mars64 commented 2 years ago

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn't only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

This got me on the right path, thank you so much!

We're using the terraform-aws-eks module which only configures security groups for control plane by default. By adding the basic rules as per the complete example, I was able to resolve this issue.

aug70 commented 1 year ago

Unable to delete application resources: dial tcp 10.105.201.92:6379: i/o timeout

This is an indication that Argo CD cannot talk to the k8s API server and I think this may be environmental. Can you confirm the application controller is able to reach the managed cluster's API server?

@jessesuen I have flannel cni, on-prem cluster and still having the same issue, even after deleting the networkpolicies. I used openssl on the controller pod to reach out to api server, connects just fine. So there's connectivity after all, problem still persists.

$ openssl s_client -connect 10.24.0.106:6443
CONNECTED(00000003)

or

$ openssl s_client -connect 10.24.0.106:10250
CONNECTED(00000003)

eagergirl2010 commented 1 year ago

I am still getting the below error after deleting all argocd network policies. I have deployed argocd on minikube cluster, it seems to be ArgoCD does not work well with minikube cluster: Unable to connect SSH repository: connection error: desc = "transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout"

Anyone can help please. it is very annoying

xlanor commented 1 year ago

I replaced flannel with calico. That worked magically. No idea how or why.

tianomagdaong commented 1 year ago

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn't only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

This got me on the right path, thank you so much!

We're using the terraform-aws-eks module which only configures security groups for control plane by default. By adding the basic rules as per the complete example, I was able to resolve this issue.

Thank you. Solved by adding the SGs for basic rules as mentioned here.

cirulls commented 1 year ago

By adding the basic rules as per the complete example, I was able to resolve this issue.

This was the solution that worked for me too.

pablo-de commented 1 year ago

Another one here that can fix by adding: cluster_security_group_additional_rules & node_security_group_additional_rules

darthale commented 1 year ago

In my case, there is some wrong in k8s network. Kubernetes network(calico:ipip) doesn't only use tcp/udp. If you use AWS, check if your security-group allow all-protocol.

This got me on the right path, thank you so much!

We're using the terraform-aws-eks module which only configures security groups for control plane by default. By adding the basic rules as per the complete example, I was able to resolve this issue.

Just to reiterate, also for me this was the fix. Specifically adding the below, helped:

 # Extend cluster security group rules
  cluster_security_group_additional_rules = {
    egress_nodes_ephemeral_ports_tcp = {
      description                = "To node 1025-65535"
      protocol                   = "tcp"
      from_port                  = 1025
      to_port                    = 65535
      type                       = "egress"
      source_node_security_group = true
    }
  }

  # Extend node-to-node security group rules
  node_security_group_additional_rules = {
    ingress_self_all = {
      description = "Node to node all ports/protocols"
      protocol    = "-1"
      from_port   = 0
      to_port     = 0
      type        = "ingress"
      self        = true
    }
    egress_all = {
      description      = "Node all egress"
      protocol         = "-1"
      from_port        = 0
      to_port          = 0
      type             = "egress"
      cidr_blocks      = ["0.0.0.0/0"]
      ipv6_cidr_blocks = ["::/0"]
    }
  }

mustafa89 commented 1 year ago

after restarting a bunch of deployments, nothing worked. I check the nodes on my cluster and there were not enough SPOT instances due to capacity issues. Just an FYI to check that as well if some one struggles with this.

saddique164 commented 1 year ago

I am facing the following issue.

Unable to create application: application spec for app1 is invalid: InvalidSpecError: repository not accessible: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout"

I am running my cluster with bare metal Linux VMS. I have deleted all the network policies by reading the above comments. I also tried with different version but still the same issue.

If some has resolved the above issue. could you please share?

Tylermarques commented 1 year ago

Like to add that I am in the same position as @saddique164. I've been trying for the last 3 days to get ArgoCD running on a K3S cluster, made up of bare metal + VMs, without any luck. The message I am regularly running into is

time="2022-11-22T17:16:39Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp: lookup argocd-redis: i/o timeout"

in argocd-server and

time="2022-11-22T17:17:35Z" level=warning msg="Failed to save clusters info: dial tcp: lookup argocd-redis: i/o timeout"

in argocd-application-controller. These errors originate in the controller/clusterinfoupdater.go file on either line 70 or 80 link. I cannot seem to trace where this is coming from, but by the look of the log message I want to guess that lookup argocd-redis: i/o timeout almost looks like it's expected a port after argocd-redis.

saddique164 commented 1 year ago

@Tylermarques I resolved the issue. You have to perform the following steps in the solution.

add the cluster properly. download Argocd CLI and I use the following command. argocd cluster add kubernetes-admin@kubernetes --in-cluster # The status will be unknow until app deployment
Remove following network Policies and restart the pods by deleting them. I. argocd-repo-server-network-policy II. argocd-server-network-policy
Please don't use argo-example repo for the deployment. It won't work. Instead of it create a public project in your repo. Push the changes. then create ssh public and private keys using this command.

ssh-keygen -t ed25519 -f argocd

Copy Public key in the ssh and GPG keys access section of github as an SSH key. Then go to argocd settings in GUI and add github repo using ssh. There you will need your private key. It will be added successfully. Create the application. When the application will be created, the cluster unknown section will go to successful automatically.

argoproj / argo-cd

i/o timeout errors in redis and argocd-repo-server PODs #4174