Open Akinorev opened 1 year ago
In my tests, a vanilla HA installation of v2.5.2, or an upgrade to it (v2.5.1 -> v2.5.2 for example) both fails at the redis-ha-server
sts component.
# kubectl get pods
NAME READY STATUS RESTARTS AGE
argocd-redis-ha-haproxy-755db98494-pnkbq 1/1 Running 0 14m
argocd-redis-ha-haproxy-755db98494-q5tmw 1/1 Running 0 14m
argocd-redis-ha-haproxy-755db98494-hjj29 1/1 Running 0 14m
argocd-redis-ha-server-0 3/3 Running 0 14m
argocd-redis-ha-server-1 3/3 Running 0 13m
argocd-redis-ha-haproxy-5b8f6b7fdd-7q7gh 0/1 Pending 0 3m7s
argocd-applicationset-controller-57bfc6fdb8-phstq 1/1 Running 0 3m7s
argocd-server-6f4c7b9859-dlln8 1/1 Running 0 3m6s
argocd-notifications-controller-954b6b785-jwwg8 1/1 Running 0 3m2s
argocd-repo-server-569dc6f989-xgnnw 1/1 Running 0 3m6s
argocd-dex-server-866c9bdd5b-rxb8x 1/1 Running 0 3m7s
argocd-server-6f4c7b9859-twn6w 1/1 Running 0 3m1s
argocd-application-controller-0 1/1 Running 0 3m2s
argocd-repo-server-569dc6f989-h478x 1/1 Running 0 2m56s
argocd-redis-ha-server-2 0/3 Init:0/1 1 (32s ago) 2m4s
# kubectl logs argocd-redis-ha-server-2 -c config-init
Tue Nov 22 04:30:41 UTC 2022 Start...
Initializing config..
Copying default redis config..
to '/data/conf/redis.conf'
Copying default sentinel config..
to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Sounds a little off that the redis-ha-server component is waiting for itself...?
I'm having the same issue in my namespaced ha install; it seems like the issue is similar to a previous problem with Redis and ipv6. After adding the bind to 0.0.0.0 in the config for sentinel and redis.conf it starts the DB fine, but the HA proxy still shows as 0 masters available, and also, argocd-server is complaining of a timeout against the database.
I'm also having a similar issue when using ArgoCD HA v2.5.2 all argocd-redis-ha-haproxy
pods go into Init:CrashLoopBackOff
. I'm running on a GKE cluster version 1.23.11-gke.300. Downgrading to ArgoCD HA v2.4.17 fixed it for me. I can provide more information about my setup if useful.
If everyone could please provide me a few additional details about your particular cluster setup in your comments. Cluster type? Eg. GKE, AWS, Azure, Digital Ocean? CNI your are using? Kubernetes version IP family? IPv4, IPv6, dual stack, or IPv6 disabled Are you using a service mesh?
Same issue here.
Happening with v2.5.1 and v2.5.2
I had the issue in version v2.5.1, and v2.5.2 had to rollback to 2.4.6 where it is working fine. Cluster type: TKG-based Cluster CNI: Antrea Kubernetes: 1.19.9 IP family: IPv6 disabled Are you using a service mesh: no
I created PR #11418 if you could please test the HA manifest in a dev environment and provide feedback. This will be based on the master branch and is not suitable for production. IPv6 only environments will not be compatible.
I will also conduct testing on my side over the next few days.
My results:
argocd-redis-ha-haproxy-59b5d8568b-kcvz6 0/1 Init:Error 2 (2m25s ago) 6m41s
argocd-redis-ha-haproxy-59b5d8568b-pbpjf 0/1 Init:CrashLoopBackOff 2 (17s ago) 6m41s
argocd-redis-ha-haproxy-59b5d8568b-ssnmq 0/1 Init:CrashLoopBackOff 2 (20s ago) 6m41s
argocd-redis-ha-server-0 0/3 Init:Error 3 (2m2s ago) 6m41s
Sat Nov 26 14:20:03 UTC 2022 Start... Initializing config.. Copying default redis config.. to '/data/conf/redis.conf' Copying default sentinel config.. to '/data/conf/sentinel.conf' Identifying redis master (get-master-addr-by-name).. using sentinel (argocd-redis-ha), sentinel group name (argocd) Could not connect to Redis at argocd-redis-ha:26379: Try again Could not connect to Redis at argocd-redis-ha:26379: Try again
Waiting for service argocd-redis-ha-announce-0 to be ready (1) ... Waiting for service argocd-redis-ha-announce-0 to be ready (2) ... Waiting for service argocd-redis-ha-announce-0 to be ready (3) ... ...
Most of the time Status of failing Pods is `Init:0/1`
- @34fathombelow All pods are starting
- v2.5.1: All pods are starting
I can confirm that this is solved with 2.5.3.
Thank you!
Can also confirm this is fixed for me with 2.5.3 Thanks :)
I tried @34fathombelow solution. Now the pods are starting, but I still have an issue with Redis:
From redis pods:
1:C 01 Dec 2022 11:07:19.788 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 01 Dec 2022 11:07:19.788 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 01 Dec 2022 11:07:19.788 # Configuration loaded 1:M 01 Dec 2022 11:07:19.789 monotonic clock: POSIX clock_gettime 1:M 01 Dec 2022 11:07:19.792 # Warning: Could not create server TCP listening socket :::6379: unable to bind socket, errno: 97 1:M 01 Dec 2022 11:07:19.793 Running mode=standalone, port=6379. 1:M 01 Dec 2022 11:07:19.793 # Server initialized 1:M 01 Dec 2022 11:07:19.794 Ready to accept connections
ha proxy pods start failing but eventually are up:
[WARNING] (7) : Server bk_redis_master/R0 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (7) : Server bk_redis_master/R1 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (7) : Server bk_redis_master/R2 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] (7) : backend 'bk_redis_master' has no server available! [WARNING] (7) : Server bk_redis_master/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 7ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.
argocd-server has the following errors all the time:
redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF
I just found this issue. Trying to upgrade from 2.4.17 to 2.5.5 and I'm running into the original error. Should I just follow this issue and try back when I see it closed, or do you guys need some help testing/validating the fix?
Thanks!
https://github.com/argoproj/argo-cd/issues/5957 feels related. We also see the same issue with an IPv4 cluster on a TKG cluster.
EDIT: Confirmed, adding bind 0.0.0.0
to redis and sentinel fixed the issue.
Hi @crenshaw-dev,
I just wanted to report that we're still facing the issue with version 2.5.6 and ha setup. We just upgraded our argo dev instance from v2.4.8 to 2.5.6 via kubectl apply -n argocd-dev -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.5.6/manifests/ha/install.yaml
and now our argocd-redis-ha-server-0 pod is no longer coming up due to:
Tue Jan 17 09:05:44 UTC 2023 Start...
Initializing config..
Copying default redis config..
to '/data/conf/redis.conf'
Copying default sentinel config..
to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Tue Jan 17 09:06:59 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found
Stream closed EOF for argocd-dev/argocd-redis-ha-server-0 (config-init)
I am also experiencing the same issue @FrittenToni describes above. argocd-redis-ha-server
starts up fine in 2.4.19, but fails on 2.5.5, 2.5.6, and 2.5.7.
Same problem with 2.5.10 on OKD 4.12. The argocd-redis-ha-server startup fine in 2.4.19 buts faitls on 2.5.10
Same here. Only 2.5.x version that's working is v2.5.3+0c7de21
Same here. failing on 2.5.6, 2.5.10 deployment and 2.6.1
Did someone try 2.6.2?
Did someone try 2.6.2?
Just did, same result.
pod/argocd-redis-ha-haproxy-c85b7ffd6-kh56p 0/1 Init:CrashLoopBackOff 18 (4m59s ago) 110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-lsbmj 0/1 Init:0/1 19 (5m21s ago) 110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-qktcv 0/1 Init:0/1 19 (5m9s ago) 110m
pod/argocd-redis-ha-server-0 0/3 Init:CrashLoopBackOff 20 (3m39s ago) 110m
not sure if this was anyones problem but for my specific issue, I was scaling the argocd-redis-ha from 3 to 5 but the chart only deploys 3 argocd-redis-ha-announce-services so I had to deploy two additional ones
I noticed that this issue appeared when we upgraded our cluster to k8s version v1.23
getent hosts cannot resolve anything in cluster.local domain
$ time oc exec argocd-redis-ha-server-0 -c config-init -- getent hosts argocd-redis-ha
command terminated with exit code 2
real 0m10.273s
user 0m0.121s
sys 0m0.036s
$ time oc exec argocd-application-controller-0 -- getent hosts argocd-redis-ha
172.30.122.223 argocd-redis-ha.argocd.svc.cluster.local
real 0m0.273s
user 0m0.120s
sys 0m0.040s
Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.
I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.
Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.
I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.
Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.
- patch: |-
- op: replace
path: /spec/egress/1/ports/0/port
value: 5353
- op: replace
path: /spec/egress/1/ports/1/port
value: 5353
target:
kind: NetworkPolicy
name: argocd-redis-ha-proxy-network-policy
- patch: |-
- op: replace
path: /spec/egress/1/ports/0/port
value: 5353
- op: replace
path: /spec/egress/1/ports/1/port
value: 5353
target:
kind: NetworkPolicy
name: argocd-redis-ha-server-network-policy
2.6.7 with OKD 4.12.0 (k8s 1.25.0) doesn't seem to work for me either (using this manifest). Similar to @kilian-hu-freiheit, the redis-ha statefulset and deployment pods never spin up. Appears to be a securityContext issue to me but having tried changing a lot of the variables around the securityContext (and granting 'anyuid' to the project) it still doesn't seem to want to boot the redis servers/proxy up.
Using 2.4.x works luckily.
This fixed the problem for us for upgrading 2.4 -> 2.6
Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work. I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.
Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.
- patch: |- - op: replace path: /spec/egress/1/ports/0/port value: 5353 - op: replace path: /spec/egress/1/ports/1/port value: 5353 target: kind: NetworkPolicy name: argocd-redis-ha-proxy-network-policy - patch: |- - op: replace path: /spec/egress/1/ports/0/port value: 5353 - op: replace path: /spec/egress/1/ports/1/port value: 5353 target: kind: NetworkPolicy name: argocd-redis-ha-server-network-policy
Stopping by to add where my issue with this symptom came from.
It had to do with the Kubernetes networking setup and the assumption with the HA redis setup of IPv4 networking. My cluster was configured in dual stack mode for IPv4 and IPv6. The IPv6 address range was the first in cluster specification, so it is the IP listed in places that don't show all IPs. Effectively if a Service
definition does specify the IP family, it will be single family and IPv6. This is a problem for the HA setup because it defaults to all IPv4 bind addresses in the templated configuration files. Switching them all to IPv6, e.g. bind ::
for redis and bind [::]:8888
, bind [::]:6379
in HAproxy resolved the issue.
I suspect also changing the ipFamily
in the service definitions to IPv4
would also work.
Both argocd-redis-ha-server and argocd-redis-ha-haproxy were unable to start in ArgoCD 2.7.10. We were updating from 2.3.12 -> 2.7.10.
Services started after removing the NetworkPolicies argocd-redis-ha-server-network-policy
and argocd-redis-ha-proxy-network-policy
. I did not inspect yet further why the NetworkPolicy causes the failure, but there's something wrong with it.
redis-ha-server config-init
container:
Thu Aug 3 14:51:42 UTC 2023 Start...
Initializing config..
Copying default redis config..
to '/data/conf/redis.conf'
Copying default sentinel config..
to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Thu Aug 3 14:52:57 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found
haproxy config-init
container:
Waiting for service argocd-redis-ha-announce-0 to be ready (1) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (2) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (3) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (4) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (5) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (6) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (7) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (8) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (9) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (10) ...
Could not resolve the announce ip for argocd-redis-ha-announce-0
There are indeed 2 issues:
(this is potentially insecure - but works...). With this ha redis pods are running "fine".
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
labels:
app.kubernetes.io/component: redis
app.kubernetes.io/name: argocd-role-ha-haproxy
app.kubernetes.io/part-of: argocd
name: argocd-role-ha-haproxy
namespace: argocd
rules:
- apiGroups:
- security.openshift.io
resourceNames:
- privileged
resources:
- securitycontextconstraints
verbs:
- use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: argocd-role-crb
namespace: argocd
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: argocd-role-ha-haproxy
subjects:
- kind: ServiceAccount
name: argocd-redis-ha-haproxy
namespace: argocd
- kind: ServiceAccount
name: argocd-redis-ha
namespace: argocd
This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well. I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24. Previously it was working fine
Here is how I solved my version of this issue. Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.
When using the argo-cd
Helm chart version 5.51.6
(= Argo CD 2.9.3
) from https://argoproj.github.io/argo-helm
with enabled high availability version through values.yaml
:
redis-ha:
enabled: true
the argocd-redis-ha-haproxy-...
pods crash and throw the following errors:
[ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101]
[ALERT] (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The argo-cd
Helm chart uses redis-ha
subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203
In my case it worked to disable this setting by supplying the following values.yaml
to the argo-cd
Helm chart:
redis-ha:
enabled: true
+ haproxy:
+ IPv6:
+ enabled: false
This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well. I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24. Previously it was working fine
you found a solution? having same issues
We see this as well with 2.7.7
Here is how I solved my version of this issue. Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.
Issue
When using the
argo-cd
Helm chart version5.51.6
(= Argo CD2.9.3
) fromhttps://argoproj.github.io/argo-helm
with enabled high availability version throughvalues.yaml
:redis-ha: enabled: true
the
argocd-redis-ha-haproxy-...
pods crash and throw the following errors:[ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888] [ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379] [ALERT] (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101] [ALERT] (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.
Cause and solution
I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The
argo-cd
Helm chart usesredis-ha
subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203In my case it worked to disable this setting by supplying the following
values.yaml
to theargo-cd
Helm chart:redis-ha: enabled: true + haproxy: + IPv6: + enabled: false
It works for me. Thank you.
We are still having issues with HA setup. We are using v2.10.12+cb6f5ac. If we close one zone, and try to zync in ArgoCD, it is stuck in "waiting to start". No errors in any logs are reported. This is a major issue, because we cannot do anything in our production environment without ArgoCD, because we are running on a hosted Kubernetes, and our only "admin" access is ArgoCD.
In our case, we had to restart CoreDNS and Cilium agents; after that, the HA worked properly. I hope this helps someone
Possibly related: Without maxconn 4096
haproxy eats up all available memory and gets OOM Killed. Pod remains in crashloop.
Checklist:
argocd version
.Describe the bug
ArgoCD is unable to deploy correctly with HA. This happens on the namespace of argocd-installation
To Reproduce
Upgrade from 2.4.6 to 2.5.1 or 2.5.2
Expected behavior
ArgoCD is upgraded/deployed successfully
Version
2.5.2 and 2.5.1 (same issue on both versions)
Logs
ha proxy:
redis ha:
repository server: