i/o timeout errors in redis and argocd-repo-server PODs

Cloud-Mak commented 4 years ago

Hi All,

I exploring the argoCD. Its quite neat project. I have deployed argoCD on K8 1.17 cluster (1 master, 2 workers) running over 3 LXD containers. I could use other stuff like metallb, ingress, rancher etc fine with this cluster.

For some reason, my argoCD isn't working the expected way. I was abe to get argoCD UI login working by using bypass method in bug 4148 that I reported earlier.

Here are svcs in argoCD ns

NAME                            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                      AGE
service/argocd-dex-server       ClusterIP   10.102.198.189   <none>        5556/TCP,5557/TCP,5558/TCP   25h
service/argocd-metrics          ClusterIP   10.104.80.68     <none>        8082/TCP                     25h
service/argocd-redis            ClusterIP   10.105.201.92    <none>        6379/TCP                     25h
service/argocd-repo-server      ClusterIP   10.98.76.94      <none>        8081/TCP,8084/TCP            25h
service/argocd-server           NodePort    10.101.169.46    <none>        80:32046/TCP,443:31275/TCP   25h
service/argocd-server-metrics   ClusterIP   10.107.61.179    <none>        8083/TCP                     25h

After I got my UI, I tried creating a new sample project from GUI, it failed. Below are the logs from during that time for argo-server

time="2020-08-27T09:22:21Z" level=info msg="received unary call /repository.RepositoryService/List" grpc.method=List grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content= grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=0.318 span.kind=server system=grpc
time="2020-08-27T09:22:21Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=List grpc.service=project.ProjectService grpc.start_time="2020-08-27T09:22:21Z" grpc.time_ms=3.441 span.kind=server system=grpc
time="2020-08-27T09:23:52Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" span.kind=server system=grpc
_time="2020-08-27T09:26:39Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:23:52Z" grpc.time_ms=167124.3 span.kind=server system=grpc_
time="2020-08-27T09:28:16Z" level=info msg="Alloc=10005 TotalAlloc=1978587 Sys=71760 NumGC=257 Goroutines=158"
time="2020-08-27T09:28:31Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" span.kind=server system=grpc
2020/08/27 09:28:48 proto: tag has too few fields: "-"
time="2020-08-27T09:28:48Z" level=info msg="received unary call /application.ApplicationService/Create" grpc.method=Create grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="application:<TypeMeta:<kind:\"\" apiVersion:\"\" > metadata:<name:\"app1\" generateName:\"\" namespace:\"\" selfLink:\"\" uid:\"\" resourceVersion:\"\" generation:0 creationTimestamp:<0001-01-01T00:00:00Z> clusterName:\"\" > spec:<source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > destination:<server:\"https://kubernetes.default.svc\" namespace:\"default\" > project:\"default\" > status:<sync:<status:\"\" comparedTo:<source:<repoURL:\"\" path:\"\" targetRevision:\"\" chart:\"\" > destination:<server:\"\" namespace:\"\" > > revision:\"\" > health:<status:\"\" message:\"\" > sourceType:\"\" summary:<> > > " grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/ListApps" grpc.method=ListApps grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="repo:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" revision:\"HEAD\" " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:11Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"yamls\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" span.kind=server system=grpc
time="2020-08-27T09:31:29Z" level=info msg="finished unary call with code InvalidArgument" error="rpc error: code = InvalidArgument desc = application spec is invalid: InvalidSpecError: Unable to get app details: rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=InvalidArgument grpc.method=Create grpc.service=application.ApplicationService grpc.start_time="2020-08-27T09:28:48Z" grpc.time_ms=161011.11 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=warning msg="finished unary call with code DeadlineExceeded" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" grpc.code=DeadlineExceeded grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:28:31Z" grpc.time_ms=182001.84 span.kind=server system=grpc
time="2020-08-27T09:31:33Z" level=info msg="received unary call /repository.RepositoryService/GetAppDetails" grpc.method=GetAppDetails grpc.request.claims="{\"iat\":1598519577,\"iss\":\"argocd\",\"nbf\":1598519577,\"sub\":\"admin\"}" grpc.request.content="source:<repoURL:\"https://github.com/Cloud-Mak/Demo_ArgoCD.git\" path:\"y\" targetRevision:\"HEAD\" chart:\"\" > " grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:33Z" span.kind=server system=grpc
time="2020-08-27T09:33:31Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout\"" grpc.code=Unavailable grpc.method=ListApps grpc.service=repository.RepositoryService grpc.start_time="2020-08-27T09:31:11Z" grpc.time_ms=140004.14 span.kind=server system=grpc

I even tried creating app using the declarating way. Created this yaml and applied mainfest using kubctl apply -f method. This created a appp visible in GUI, but it was never deployed. The health status eventually became healthy, but the sync status remained unknown.

From GUI, I can see below errors under applications conditions one after another

ComparisonError
rpc error: code = DeadlineExceeded desc = context deadline exceeded

ComparisonError
rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.98.76.94:8081: i/o timeout"

While I tried deleting the app from GUI, it was stuck in deleting with below error visible under events in GUI

DeletionError
dial tcp 10.105.201.92:6379: i/o timeout
Unable to load data: dial tcp 10.105.201.92:6379: i/o timeout
Unable to delete application resources: dial tcp 10.105.201.92:6379: i/o timeout

As of now, nothing is working for my in argoCD. I am clueless as to what to do now.

Tylermarques commented 1 year ago

Thanks for replying @saddique164. I've added the cluster, deleted the network policies, and haven't used the demo. I'm trying to use a known good helm repo, one I wrote which works as expected when I run helm install ___ .

For anyone else having similar errors as below, I would recommend you go through the debugging process found here.

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout"

My issue was DNS/CNI related. Flannel had failed to install for some reason, which meant that traffic that was on different nodes could not reach each other.

irizzant commented 1 year ago

I'm hitting the same issue, ArgoCD 2.5.3 deployed with k3d last version.

When ArgoCD starts applications are in Unknown status and they give rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.43.41.166:8081: i/o timeout"

All ArgoCD pods are running fine, DNS is working, having ArgoCD apps with auto-sync and sealf-heal enabled makes them reach an Healthy state but they stay in Unknown status for a while

Tylermarques commented 1 year ago

@irizzant Are your argo pods all on the same node? My issue only showed it's head once a pod was scheduled on a separate node from the others.

irizzant commented 1 year ago

@Tylermarques my pods are all in the same node, it looks like a problem in the networkpolicy resources to me. I don't have any error or problem in the cluster but this

saddique164 commented 1 year ago

@irizzant also check coredns pods. If they have been restarted then check the logs for any error. If you find any issue. Just restart coredns pods.

irizzant commented 1 year ago

@saddique164 I checked coredns pods and found no errors

saddique164 commented 1 year ago

@irizzant after deleting the policy restart redis-server , argocd server and argocd-repo-server pods. Kill them and let them be alive. If you are not working on production server, also restart coredns pods. I believe that tit will resolve your issue.

irizzant commented 1 year ago

@saddique164 the workaround you described seems to work for me ( like I said, no problem was detected at DNS level) but also sounds a confirmation that there's something wrong with the NetworkPolicy resources

pc-tzimmerman commented 1 year ago

In my case, I was not specifying a nodeSelector in my Helm values. This caused some of the pods to sometimes land on nodes in a worker group that (similar to what others have described above) did not have the proper security group rules in place.

Using a nodeSelector/affinity to force the pods not to land on these worker node groups solved the issue.

bytesWhisperer commented 1 year ago

My problem was fixed by updating flannel deployment to the newest version

mojitaleghani commented 1 year ago

Thanks for the hint @jrhoward, but unfortunately this did not solve the issue for me disappointed

It seems that no one of the Argo services can talk with redis.

argocd-server has logs like:

time="2021-06-09T09:42:15Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.248.24:6379: i/o timeout"

These are the startup logs of argocd-application-controller:

time="2021-06-09T09:40:06Z" level=info msg="Processing all cluster shards"
time="2021-06-09T09:40:06Z" level=info msg="appResyncPeriod=3m0s"
time="2021-06-09T09:40:06Z" level=info msg="Application Controller (version: v2.0.3+8d2b13d, built: 2021-05-27T17:38:37Z) starting (namespace: argocd)"
time="2021-06-09T09:40:06Z" level=info msg="Starting configmap/secret informers"
time="2021-06-09T09:40:06Z" level=info msg="Configmap/secret informer synced"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="0xc00097b1a0 subscribed to settings updates"
time="2021-06-09T09:40:06Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Starting clusterSecretInformer informers"
time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions"
time="2021-06-09T09:40:06Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: drone)" application=drone
time="2021-06-09T09:40:06Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc"
W0609 09:40:06.719086       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
W0609 09:40:06.815954       1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress
time="2021-06-09T09:40:26Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:40:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:36Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:41:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
time="2021-06-09T09:42:37Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout"
...

It seems also that services cannot talk to each other. I found this in argocd-server logs:

time="2021-06-09T08:58:15Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.55.98:8081: connect: connection refused\"" grpc.code=Unavailable grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2021-06-09T08:58:13Z" grpc.time_ms=2007.533 span.kind=server system=grpc

where 10.43.55.98 is the ClusterIP of the argocd-repo-server service.

I'm very puzzled.

Hi there, had the same issue and stocked in it for several days, logs from the argocd-server showed that the argocd-server connection to the argocd-redis is not established ot got timeout every action it took. so it might be some problems with the argocd-server or its networks. check if the argocd-redis pod it ready,if so, check if the core-dns of the k8s cluster is ready and healthy. In my case it way my calico-node that was not in running mode!

kubectl -n kube-system delete pods calico-node-876rj

and tadaaaa :) it was helpful to me. hope works for you.

patrickacioli commented 1 year ago

For me, removing network policies works.

patrickacioli commented 1 year ago

Just for complete, I think this issue maybe be related with incompatibility of k8s version and ArgoCD Version. I'm using k8s 1.24 and try to install stable version of Argo, which produce this issue. But, by installing version 2.4 of Argo CD, works after removing redis network police (I don't try without removing network policy)

BallsyWalnuts commented 1 year ago

Ran into this issue with bare-metal K8s with flannel installed as the CNI. Like others in the thread, migrating from flannel to calico resolved the issue for me.

rbalukas2 commented 1 year ago

If you are running Weave and having issues with connectivity with the network policies in place, check the IPALLOC_RANGE setting in weave referenced here: https://www.weave.works/docs/net/latest/kubernetes/kube-addon/ If you do not have this set correctly, you will see your argo traffic BLOCKED showing up in the weave pod logs ( grep the weave logs for 'BLOCKED').

tanrobotix commented 1 year ago

If your cluster is in AWS, open all traffic for the internal connection of the node subnets All problems will be sorted

v-porytskyi commented 1 year ago

Hi all :) I had the same problem. In my case, I use Ubuntu with microk8s. So, in the configmap found the next line https://github.com/argoproj/argo-cd/blob/ef7f32eb844739d8ae5b5feb987f32fa63024226/docs/operator-manual/argocd-cmd-params-cm.yaml#L13 .

I enabled addons - Core DNS and the problem was solved. After that GitHub Repository was added successfully. Maybe, it helps anybody.

sebbbastien commented 10 months ago

On my side, changing K3s install from --flannel-backend=host-gw to --flannel-backend=vxlan solve this issue.

Before:

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.143.0.10

nslookup: can't resolve 'kubernetes.default'

After:

kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Server:    10.143.0.10
Address 1: 10.143.0.10 kube-dns.kube-system.svc.cluster.local

Name:      kubernetes.default
Address 1: 10.143.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted

This issue was not present since I decide to move from 1 single K3s node to a cluster of 3.

medXPS commented 9 months ago

I had been struggling with similar problems this week getting ArgoCD working in a small Vagrant/VirtualBox environment. I switched from Flannel to Calico and everything just magically started working.

I'm using calico but I the same issue!

steled commented 9 months ago

Potentially only a small number of you are affected, but I would still like to point this out. We use Cilium in the vSphere environment and I run into the same problem if I do not implement the solution described in #21801.

Balti006 commented 8 months ago

I'm facing the same issue. I tried every thing but failed. Then I delete all the NetworkPolicies in the argocd namespace and restart all the pods. It worked fine

SavaMihai commented 7 months ago

I'm having the same issue to the point that I can't even use ArgoCD properly and I ran out of ideas on how to fix this ...

timothepoznanski commented 7 months ago

I just deleted the argocd-redis-network-policy in argocd namespace and it worked immediatly :

kubectl delete networkpolicies -n argocd argocd-redis-network-policy
networkpolicy.networking.k8s.io "argocd-redis-network-policy" deleted

condaatje commented 4 months ago

thought I'd chime in with what the solution ended up being for me here - I had installed calico just after building the cluster (before joining nodes, critically). Then installed argocd.

However, a restart of the kube-system.coredns deployment was required before argocd could actually make use of the calico CNI setup. Hope this helps somebody with the same problem, though I'm sure there are many other ways this might be happening in various clusters.

Drezir commented 2 months ago

I had everything done correctly. I had to reconnect same repo again to make it work.

leonardo-zorzi commented 1 week ago

On our new cluster this was the case because we have not yet set the pod and service CIDRs in nonMasqueradeCIDRs of the ip-masq-agent CM.

nc -v -w 5 <repo-server-cluster-ip> 8081 from a debug container also returned a timeout

argoproj / argo-cd

i/o timeout errors in redis and argocd-repo-server PODs #4174