Open Cloud-Mak opened 4 years ago
Thanks for replying @saddique164. I've added the cluster, deleted the network policies, and haven't used the demo. I'm trying to use a known good helm repo, one I wrote which works as expected when I run helm install ___ .
For anyone else having similar errors as below, I would recommend you go through the debugging process found here.
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp: lookup argocd-repo-server: i/o timeout"
My issue was DNS/CNI related. Flannel had failed to install for some reason, which meant that traffic that was on different nodes could not reach each other.
I'm hitting the same issue, ArgoCD 2.5.3 deployed with k3d last version.
When ArgoCD starts applications are in Unknown status and they give
rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.43.41.166:8081: i/o timeout"
All ArgoCD pods are running fine, DNS is working, having ArgoCD apps with auto-sync and sealf-heal enabled makes them reach an Healthy state but they stay in Unknown status for a while
@irizzant Are your argo pods all on the same node? My issue only showed it's head once a pod was scheduled on a separate node from the others.
@Tylermarques my pods are all in the same node, it looks like a problem in the networkpolicy resources to me. I don't have any error or problem in the cluster but this
@irizzant also check coredns pods. If they have been restarted then check the logs for any error. If you find any issue. Just restart coredns pods.
@saddique164 I checked coredns pods and found no errors
@irizzant after deleting the policy restart redis-server , argocd server and argocd-repo-server pods. Kill them and let them be alive. If you are not working on production server, also restart coredns pods. I believe that tit will resolve your issue.
@saddique164 the workaround you described seems to work for me ( like I said, no problem was detected at DNS level) but also sounds a confirmation that there's something wrong with the NetworkPolicy resources
In my case, I was not specifying a nodeSelector
in my Helm values. This caused some of the pods to sometimes land on nodes in a worker group that (similar to what others have described above) did not have the proper security group rules in place.
Using a nodeSelector/affinity to force the pods not to land on these worker node groups solved the issue.
My problem was fixed by updating flannel deployment to the newest version
Thanks for the hint @jrhoward, but unfortunately this did not solve the issue for me disappointed
It seems that no one of the Argo services can talk with redis.
argocd-server
has logs like:time="2021-06-09T09:42:15Z" level=warning msg="Failed to resync revoked tokens. retrying again in 1 minute: dial tcp 10.43.248.24:6379: i/o timeout"
These are the startup logs of
argocd-application-controller
:time="2021-06-09T09:40:06Z" level=info msg="Processing all cluster shards" time="2021-06-09T09:40:06Z" level=info msg="appResyncPeriod=3m0s" time="2021-06-09T09:40:06Z" level=info msg="Application Controller (version: v2.0.3+8d2b13d, built: 2021-05-27T17:38:37Z) starting (namespace: argocd)" time="2021-06-09T09:40:06Z" level=info msg="Starting configmap/secret informers" time="2021-06-09T09:40:06Z" level=info msg="Configmap/secret informer synced" time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions" time="2021-06-09T09:40:06Z" level=info msg="0xc00097b1a0 subscribed to settings updates" time="2021-06-09T09:40:06Z" level=info msg="Refreshing app status (normal refresh requested), level (2)" application=drone time="2021-06-09T09:40:06Z" level=info msg="Starting clusterSecretInformer informers" time="2021-06-09T09:40:06Z" level=info msg="Ignore status for CustomResourceDefinitions" time="2021-06-09T09:40:06Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: drone)" application=drone time="2021-06-09T09:40:06Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc" W0609 09:40:06.719086 1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress W0609 09:40:06.815954 1 warnings.go:70] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress time="2021-06-09T09:40:26Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:40:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:41:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:41:36Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:41:56Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:42:16Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" time="2021-06-09T09:42:37Z" level=warning msg="Failed to save clusters info: dial tcp 10.43.248.24:6379: i/o timeout" ...
It seems also that services cannot talk to each other. I found this in
argocd-server
logs:time="2021-06-09T08:58:15Z" level=warning msg="finished unary call with code Unavailable" error="rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.43.55.98:8081: connect: connection refused\"" grpc.code=Unavailable grpc.method=GetAppDetails grpc.service=repository.RepositoryService grpc.start_time="2021-06-09T08:58:13Z" grpc.time_ms=2007.533 span.kind=server system=grpc
where
10.43.55.98
is the ClusterIP of theargocd-repo-server
service.I'm very puzzled.
Hi there, had the same issue and stocked in it for several days, logs from the argocd-server showed that the argocd-server connection to the argocd-redis is not established ot got timeout every action it took. so it might be some problems with the argocd-server or its networks. check if the argocd-redis pod it ready,if so, check if the core-dns of the k8s cluster is ready and healthy. In my case it way my calico-node that was not in running mode!
kubectl -n kube-system delete pods calico-node-876rj
and tadaaaa :) it was helpful to me. hope works for you.
For me, removing network policies works.
Just for complete, I think this issue maybe be related with incompatibility of k8s version and ArgoCD Version. I'm using k8s 1.24 and try to install stable version of Argo, which produce this issue. But, by installing version 2.4 of Argo CD, works after removing redis network police (I don't try without removing network policy)
Ran into this issue with bare-metal K8s with flannel installed as the CNI. Like others in the thread, migrating from flannel to calico resolved the issue for me.
If you are running Weave and having issues with connectivity with the network policies in place, check the IPALLOC_RANGE setting in weave referenced here: https://www.weave.works/docs/net/latest/kubernetes/kube-addon/ If you do not have this set correctly, you will see your argo traffic BLOCKED showing up in the weave pod logs ( grep the weave logs for 'BLOCKED').
If your cluster is in AWS, open all traffic for the internal connection of the node subnets All problems will be sorted
Hi all :) I had the same problem. In my case, I use Ubuntu with microk8s. So, in the configmap found the next line https://github.com/argoproj/argo-cd/blob/ef7f32eb844739d8ae5b5feb987f32fa63024226/docs/operator-manual/argocd-cmd-params-cm.yaml#L13 .
I enabled addons - Core DNS and the problem was solved. After that GitHub Repository was added successfully. Maybe, it helps anybody.
On my side, changing K3s install from --flannel-backend=host-gw
to --flannel-backend=vxlan
solve this issue.
Before:
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Address 1: 10.143.0.10
nslookup: can't resolve 'kubernetes.default'
After:
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
If you don't see a command prompt, try pressing enter.
Server: 10.143.0.10
Address 1: 10.143.0.10 kube-dns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.143.0.1 kubernetes.default.svc.cluster.local
pod "busybox" deleted
This issue was not present since I decide to move from 1 single K3s node to a cluster of 3.
I had been struggling with similar problems this week getting ArgoCD working in a small Vagrant/VirtualBox environment. I switched from Flannel to Calico and everything just magically started working.
I'm using calico but I the same issue!
Potentially only a small number of you are affected, but I would still like to point this out. We use Cilium in the vSphere environment and I run into the same problem if I do not implement the solution described in #21801.
I'm facing the same issue. I tried every thing but failed. Then I delete all the NetworkPolicies in the argocd namespace and restart all the pods. It worked fine
I'm having the same issue to the point that I can't even use ArgoCD properly and I ran out of ideas on how to fix this ...
I just deleted the argocd-redis-network-policy in argocd namespace and it worked immediatly :
kubectl delete networkpolicies -n argocd argocd-redis-network-policy
networkpolicy.networking.k8s.io "argocd-redis-network-policy" deleted
thought I'd chime in with what the solution ended up being for me here - I had installed calico just after building the cluster (before joining nodes, critically). Then installed argocd.
However, a restart of the kube-system.coredns deployment was required before argocd could actually make use of the calico CNI setup. Hope this helps somebody with the same problem, though I'm sure there are many other ways this might be happening in various clusters.
I had everything done correctly. I had to reconnect same repo again to make it work.
On our new cluster this was the case because we have not yet set the pod and service CIDRs in nonMasqueradeCIDRs of the ip-masq-agent CM.
nc -v -w 5 <repo-server-cluster-ip> 8081
from a debug container also returned a timeout
Hi All,
I exploring the argoCD. Its quite neat project. I have deployed argoCD on K8 1.17 cluster (1 master, 2 workers) running over 3 LXD containers. I could use other stuff like metallb, ingress, rancher etc fine with this cluster.
For some reason, my argoCD isn't working the expected way. I was abe to get argoCD UI login working by using bypass method in bug 4148 that I reported earlier.
Here are svcs in argoCD ns
After I got my UI, I tried creating a new sample project from GUI, it failed. Below are the logs from during that time for argo-server
I even tried creating app using the declarating way. Created this yaml and applied mainfest using kubctl apply -fmethod. This created a appp visible in GUI, but it was never deployed. The health status eventually became healthy, but the sync status remained unknown.
From GUI, I can see below errors under applications conditions one after another
While I tried deleting the app from GUI, it was stuck in deleting with below error visible under events in GUI
As of now, nothing is working for my in argoCD. I am clueless as to what to do now.