argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.88k stars 5.46k forks source link

Unable to deploy ArgoCD with HA #11388

Open Akinorev opened 1 year ago

Akinorev commented 1 year ago

Checklist:

Describe the bug

ArgoCD is unable to deploy correctly with HA. This happens on the namespace of argocd-installation

To Reproduce

Upgrade from 2.4.6 to 2.5.1 or 2.5.2

Expected behavior

ArgoCD is upgraded/deployed successfully

Version

2.5.2 and 2.5.1 (same issue on both versions)

Logs

ha proxy:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

redis ha:

21 Nov 2022 16:22:36.369 # Configuration loaded
21 Nov 2022 16:22:36.370 * monotonic clock: POSIX clock_gettime
21 Nov 2022 16:22:36.377 # Warning: Could not create server TCP listening socket ::*:6379: unable to bind socket, errno: 97
21 Nov 2022 16:22:36.378 * Running mode=standalone, port=6379.
21 Nov 2022 16:22:36.378 # Server initialized
21 Nov 2022 16:22:36.379 * Ready to accept connections

repository server:

time="2022-11-21T16:25:46Z" level=info msg="ArgoCD Repository Server is starting" built="2022-11-07T16:42:47Z" commit=148d8da7a996f6c9f4d102fdd8e688c2ff3fd8c7 port=8081 version=v2.5.2+148d8da
time="2022-11-21T16:25:46Z" level=info msg="Generating self-signed TLS certificate for this session"
time="2022-11-21T16:25:46Z" level=info msg="Initializing GnuPG keyring at /app/config/gpg/keys"
time="2022-11-21T16:25:46Z" level=info msg="gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569" dir= execID=9e8d3
time="2022-11-21T16:25:52Z" level=error msg="`gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569` failed exit status 2" execID=9e8d3
time="2022-11-21T16:25:52Z" level=info msg=Trace args="[gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569]" dir= operation_name="exec gpg" time_ms=6031.865355
time="2022-11-21T16:25:52Z" level=fatal msg="`gpg --no-permission-warning --logger-fd 1 --batch --gen-key /tmp/gpg-key-recipe238040569` failed exit status 2"
makeittotop commented 1 year ago

In my tests, a vanilla HA installation of v2.5.2, or an upgrade to it (v2.5.1 -> v2.5.2 for example) both fails at the redis-ha-server sts component.

# kubectl get pods
NAME                                                READY   STATUS     RESTARTS      AGE
argocd-redis-ha-haproxy-755db98494-pnkbq            1/1     Running    0             14m
argocd-redis-ha-haproxy-755db98494-q5tmw            1/1     Running    0             14m
argocd-redis-ha-haproxy-755db98494-hjj29            1/1     Running    0             14m
argocd-redis-ha-server-0                            3/3     Running    0             14m
argocd-redis-ha-server-1                            3/3     Running    0             13m
argocd-redis-ha-haproxy-5b8f6b7fdd-7q7gh            0/1     Pending    0             3m7s
argocd-applicationset-controller-57bfc6fdb8-phstq   1/1     Running    0             3m7s
argocd-server-6f4c7b9859-dlln8                      1/1     Running    0             3m6s
argocd-notifications-controller-954b6b785-jwwg8     1/1     Running    0             3m2s
argocd-repo-server-569dc6f989-xgnnw                 1/1     Running    0             3m6s
argocd-dex-server-866c9bdd5b-rxb8x                  1/1     Running    0             3m7s
argocd-server-6f4c7b9859-twn6w                      1/1     Running    0             3m1s
argocd-application-controller-0                     1/1     Running    0             3m2s
argocd-repo-server-569dc6f989-h478x                 1/1     Running    0             2m56s
argocd-redis-ha-server-2                            0/3     Init:0/1   1 (32s ago)   2m4s

# kubectl logs argocd-redis-ha-server-2 -c config-init
Tue Nov 22 04:30:41 UTC 2022 Start...
Initializing config..
Copying default redis config..
  to '/data/conf/redis.conf'
Copying default sentinel config..
  to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again

Sounds a little off that the redis-ha-server component is waiting for itself...?

acartag7 commented 1 year ago

I'm having the same issue in my namespaced ha install; it seems like the issue is similar to a previous problem with Redis and ipv6. After adding the bind to 0.0.0.0 in the config for sentinel and redis.conf it starts the DB fine, but the HA proxy still shows as 0 masters available, and also, argocd-server is complaining of a timeout against the database.

ghost commented 1 year ago

I'm also having a similar issue when using ArgoCD HA v2.5.2 all argocd-redis-ha-haproxy pods go into Init:CrashLoopBackOff. I'm running on a GKE cluster version 1.23.11-gke.300. Downgrading to ArgoCD HA v2.4.17 fixed it for me. I can provide more information about my setup if useful.

34fathombelow commented 1 year ago

If everyone could please provide me a few additional details about your particular cluster setup in your comments. Cluster type? Eg. GKE, AWS, Azure, Digital Ocean? CNI your are using? Kubernetes version IP family? IPv4, IPv6, dual stack, or IPv6 disabled Are you using a service mesh?

otherguy commented 1 year ago

Same issue here.

Happening with v2.5.1 and v2.5.2

acartag7 commented 1 year ago

I had the issue in version v2.5.1, and v2.5.2 had to rollback to 2.4.6 where it is working fine. Cluster type: TKG-based Cluster CNI: Antrea Kubernetes: 1.19.9 IP family: IPv6 disabled Are you using a service mesh: no

34fathombelow commented 1 year ago

I created PR #11418 if you could please test the HA manifest in a dev environment and provide feedback. This will be based on the master branch and is not suitable for production. IPv6 only environments will not be compatible.

I will also conduct testing on my side over the next few days.

Glutamat42 commented 1 year ago

My results:

logs argocd-redis-ha-server-0 -n argocd -c config-init

Sat Nov 26 14:20:03 UTC 2022 Start... Initializing config.. Copying default redis config.. to '/data/conf/redis.conf' Copying default sentinel config.. to '/data/conf/sentinel.conf' Identifying redis master (get-master-addr-by-name).. using sentinel (argocd-redis-ha), sentinel group name (argocd) Could not connect to Redis at argocd-redis-ha:26379: Try again Could not connect to Redis at argocd-redis-ha:26379: Try again

logs argocd-redis-ha-haproxy-59b5d8568b-kcvz6 -n argocd -c config-init

Waiting for service argocd-redis-ha-announce-0 to be ready (1) ... Waiting for service argocd-redis-ha-announce-0 to be ready (2) ... Waiting for service argocd-redis-ha-announce-0 to be ready (3) ... ...


Most of the time Status of failing Pods is `Init:0/1`
- @34fathombelow All pods are starting
- v2.5.1: All pods are starting
otherguy commented 1 year ago

I can confirm that this is solved with 2.5.3.

Thank you!

Glutamat42 commented 1 year ago

Can also confirm this is fixed for me with 2.5.3 Thanks :)

acartag7 commented 1 year ago

I tried @34fathombelow solution. Now the pods are starting, but I still have an issue with Redis:

From redis pods:

1:C 01 Dec 2022 11:07:19.788 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 1:C 01 Dec 2022 11:07:19.788 # Redis version=7.0.5, bits=64, commit=00000000, modified=0, pid=1, just started 1:C 01 Dec 2022 11:07:19.788 # Configuration loaded 1:M 01 Dec 2022 11:07:19.789 monotonic clock: POSIX clock_gettime 1:M 01 Dec 2022 11:07:19.792 # Warning: Could not create server TCP listening socket :::6379: unable to bind socket, errno: 97 1:M 01 Dec 2022 11:07:19.793 Running mode=standalone, port=6379. 1:M 01 Dec 2022 11:07:19.793 # Server initialized 1:M 01 Dec 2022 11:07:19.794 Ready to accept connections

ha proxy pods start failing but eventually are up:

[WARNING] (7) : Server bk_redis_master/R0 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (7) : Server bk_redis_master/R1 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [WARNING] (7) : Server bk_redis_master/R2 is DOWN, reason: Layer4 timeout, info: " at step 1 of tcp-check (connect)", check duration: 3001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. [ALERT] (7) : backend 'bk_redis_master' has no server available! [WARNING] (7) : Server bk_redis_master/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 7ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R0 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 1 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R1 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue. [WARNING] (7) : Server check_if_redis_is_master_0/R2 is UP, reason: Layer7 check passed, code: 0, info: "(tcp-check)", check duration: 3ms. 3 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

argocd-server has the following errors all the time:

redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF redis: 2022/12/01 11:07:35 pubsub.go:159: redis: discarding bad PubSub connection: EOF

kelly-brown commented 1 year ago

I just found this issue. Trying to upgrade from 2.4.17 to 2.5.5 and I'm running into the original error. Should I just follow this issue and try back when I see it closed, or do you guys need some help testing/validating the fix?

Thanks!

rumstead commented 1 year ago

https://github.com/argoproj/argo-cd/issues/5957 feels related. We also see the same issue with an IPv4 cluster on a TKG cluster.

EDIT: Confirmed, adding bind 0.0.0.0 to redis and sentinel fixed the issue.

FrittenToni commented 1 year ago

Hi @crenshaw-dev,

I just wanted to report that we're still facing the issue with version 2.5.6 and ha setup. We just upgraded our argo dev instance from v2.4.8 to 2.5.6 via kubectl apply -n argocd-dev -f https://raw.githubusercontent.com/argoproj/argo-cd/v2.5.6/manifests/ha/install.yaml and now our argocd-redis-ha-server-0 pod is no longer coming up due to:

Tue Jan 17 09:05:44 UTC 2023 Start...
Initializing config..
Copying default redis config..
to '/data/conf/redis.conf'
Copying default sentinel config..
to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Tue Jan 17 09:06:59 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found
Stream closed EOF for argocd-dev/argocd-redis-ha-server-0 (config-init)

seanmmills commented 1 year ago

I am also experiencing the same issue @FrittenToni describes above. argocd-redis-ha-server starts up fine in 2.4.19, but fails on 2.5.5, 2.5.6, and 2.5.7.

jas01 commented 1 year ago

Same problem with 2.5.10 on OKD 4.12. The argocd-redis-ha-server startup fine in 2.4.19 buts faitls on 2.5.10

otherguy commented 1 year ago

Same here. Only 2.5.x version that's working is v2.5.3+0c7de21

johnoct-au commented 1 year ago

Same here. failing on 2.5.6, 2.5.10 deployment and 2.6.1

otherguy commented 1 year ago

Did someone try 2.6.2?

jas01 commented 1 year ago

Did someone try 2.6.2?

Just did, same result.

pod/argocd-redis-ha-haproxy-c85b7ffd6-kh56p             0/1     Init:CrashLoopBackOff   18 (4m59s ago)   110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-lsbmj             0/1     Init:0/1                19 (5m21s ago)   110m
pod/argocd-redis-ha-haproxy-c85b7ffd6-qktcv             0/1     Init:0/1                19 (5m9s ago)    110m
pod/argocd-redis-ha-server-0                            0/3     Init:CrashLoopBackOff   20 (3m39s ago)   110m
johnoct-au commented 1 year ago

not sure if this was anyones problem but for my specific issue, I was scaling the argocd-redis-ha from 3 to 5 but the chart only deploys 3 argocd-redis-ha-announce-services so I had to deploy two additional ones

rimasgo commented 1 year ago

I noticed that this issue appeared when we upgraded our cluster to k8s version v1.23

getent hosts cannot resolve anything in cluster.local domain

$ time oc exec argocd-redis-ha-server-0 -c config-init -- getent hosts argocd-redis-ha
command terminated with exit code 2

real    0m10.273s
user    0m0.121s
sys     0m0.036s

$ time oc exec argocd-application-controller-0 -- getent hosts argocd-redis-ha
172.30.122.223  argocd-redis-ha.argocd.svc.cluster.local

real    0m0.273s
user    0m0.120s
sys     0m0.040s
rimasgo commented 1 year ago

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.

I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

seanmmills commented 1 year ago

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work.

I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-proxy-network-policy

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-server-network-policy
sc0ttes commented 1 year ago

2.6.7 with OKD 4.12.0 (k8s 1.25.0) doesn't seem to work for me either (using this manifest). Similar to @kilian-hu-freiheit, the redis-ha statefulset and deployment pods never spin up. Appears to be a securityContext issue to me but having tried changing a lot of the variables around the securityContext (and granting 'anyuid' to the project) it still doesn't seem to want to boot the redis servers/proxy up.

Using 2.4.x works luckily.

yasargil commented 1 year ago

This fixed the problem for us for upgrading 2.4 -> 2.6

Seems that network policies argocd-redis-ha-proxy-network-policy and argocd-redis-ha-server-network-policy has to be reviewed. After deleting both policies everything started to work. I have checked no other network policy has defined ports for DNS and only the above two have port 53 defined which is incorrect (for Openshift). Changed UPD/TCP ports to 5353 and everything came back to life.

Nice find @rimasgo! I verified this works for our deployment as well via kustomize changes against v2.6.2.

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-proxy-network-policy

- patch: |-
    - op: replace
      path: /spec/egress/1/ports/0/port
      value: 5353
    - op: replace
      path: /spec/egress/1/ports/1/port
      value: 5353
  target:
    kind: NetworkPolicy
    name: argocd-redis-ha-server-network-policy
cehoffman commented 1 year ago

Stopping by to add where my issue with this symptom came from.

It had to do with the Kubernetes networking setup and the assumption with the HA redis setup of IPv4 networking. My cluster was configured in dual stack mode for IPv4 and IPv6. The IPv6 address range was the first in cluster specification, so it is the IP listed in places that don't show all IPs. Effectively if a Service definition does specify the IP family, it will be single family and IPv6. This is a problem for the HA setup because it defaults to all IPv4 bind addresses in the templated configuration files. Switching them all to IPv6, e.g. bind :: for redis and bind [::]:8888, bind [::]:6379 in HAproxy resolved the issue.

I suspect also changing the ipFamily in the service definitions to IPv4 would also work.

pre commented 1 year ago

Both argocd-redis-ha-server and argocd-redis-ha-haproxy were unable to start in ArgoCD 2.7.10. We were updating from 2.3.12 -> 2.7.10.

Services started after removing the NetworkPolicies argocd-redis-ha-server-network-policy and argocd-redis-ha-proxy-network-policy. I did not inspect yet further why the NetworkPolicy causes the failure, but there's something wrong with it.

redis-ha-server config-init container:

Thu Aug  3 14:51:42 UTC 2023 Start...
Initializing config..
Copying default redis config..
  to '/data/conf/redis.conf'
Copying default sentinel config..
  to '/data/conf/sentinel.conf'
Identifying redis master (get-master-addr-by-name)..
  using sentinel (argocd-redis-ha), sentinel group name (argocd)
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
Could not connect to Redis at argocd-redis-ha:26379: Try again
  Thu Aug  3 14:52:57 UTC 2023 Did not find redis master ()
Identify announce ip for this pod..
  using (argocd-redis-ha-announce-0) or (argocd-redis-ha-server-0)
  identified announce ()
/readonly-config/init.sh: line 239: Error: Could not resolve the announce ip for this pod.: not found

haproxy config-init container:

Waiting for service argocd-redis-ha-announce-0 to be ready (1) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (2) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (3) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (4) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (5) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (6) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (7) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (8) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (9) ...
Waiting for service argocd-redis-ha-announce-0 to be ready (10) ...
Could not resolve the announce ip for argocd-redis-ha-announce-0
dmpe commented 1 year ago

There are indeed 2 issues:

(this is potentially insecure - but works...). With this ha redis pods are running "fine".

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  labels:
    app.kubernetes.io/component: redis
    app.kubernetes.io/name: argocd-role-ha-haproxy
    app.kubernetes.io/part-of: argocd
  name: argocd-role-ha-haproxy
  namespace: argocd
rules:
- apiGroups:
  - security.openshift.io
  resourceNames:
  - privileged
  resources:
  - securitycontextconstraints
  verbs:
  - use
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: argocd-role-crb
  namespace: argocd
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: argocd-role-ha-haproxy
subjects:
- kind: ServiceAccount
  name: argocd-redis-ha-haproxy
  namespace: argocd
- kind: ServiceAccount
  name: argocd-redis-ha
  namespace: argocd
adjain131995 commented 1 year ago

This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well. I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24. Previously it was working fine

julian-waibel commented 11 months ago

Here is how I solved my version of this issue. Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.

Issue

When using the argo-cd Helm chart version 5.51.6 (= Argo CD 2.9.3) from https://argoproj.github.io/argo-helm with enabled high availability version through values.yaml:

redis-ha:
  enabled: true

the argocd-redis-ha-haproxy-... pods crash and throw the following errors:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

Cause and solution

I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The argo-cd Helm chart uses redis-ha subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203

In my case it worked to disable this setting by supplying the following values.yaml to the argo-cd Helm chart:

redis-ha:
  enabled: true
+ haproxy:
+   IPv6:
+     enabled: false
saintmalik commented 7 months ago

This is certainly a big issue, I am running argocd on EKS 1.24. In my argocd module network policies do not exist so I have nothing to delete as well as my cluster is purely ipv4 so there is not solution there as well. I am running v2.7.6 and the only thing that changed in Kubernetes 1.23 to 1.24. Previously it was working fine

you found a solution? having same issues

mjnovice commented 7 months ago

We see this as well with 2.7.7

1ocate commented 5 months ago

Here is how I solved my version of this issue. Edit: Maybe this comment is only relevant for the Helm chart version of Argo CD. However I leave this comment here in hope that it might be useful to somebody.

Issue

When using the argo-cd Helm chart version 5.51.6 (= Argo CD 2.9.3) from https://argoproj.github.io/argo-helm with enabled high availability version through values.yaml:

redis-ha:
  enabled: true

the argocd-redis-ha-haproxy-... pods crash and throw the following errors:

[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:9] for proxy health_check_http_url: cannot create receiving socket (Address family not supported by protocol) for [:::8888]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:56] for frontend ft_redis_master: cannot create receiving socket (Address family not supported by protocol) for [:::6379]
[ALERT]    (1) : Binding [/usr/local/etc/haproxy/haproxy.cfg:77] for frontend stats: cannot create receiving socket (Address family not supported by protocol) for [:::9101]
[ALERT]    (1) : [haproxy.main()] Some protocols failed to start their listeners! Exiting.

Cause and solution

I am running a Rancher RKE2 on-premise cluster which has IPv4/IPv6 dual-stack networking enabled. However it looks like IPv6 was not correctly enabled or is not correctly configured for the cluster. The argo-cd Helm chart uses redis-ha subchart (see https://github.com/argoproj/argo-helm/blob/c3c588038daa7c550bbd977c1298a1fd3f42d7c8/charts/argo-cd/Chart.yaml#L20-L23) which itself uses HAProxy configured to bind and consume IPv6 addresses by default, see https://github.com/DandyDeveloper/charts/blob/e12198606457c7281cd60bd1ed41bd8b0a34cd53/charts/redis-ha/values.yaml#L201C13-L203

In my case it worked to disable this setting by supplying the following values.yaml to the argo-cd Helm chart:

redis-ha:
  enabled: true
+ haproxy:
+   IPv6:
+     enabled: false

It works for me. Thank you.

Casper-dss commented 3 months ago

We are still having issues with HA setup. We are using v2.10.12+cb6f5ac. If we close one zone, and try to zync in ArgoCD, it is stuck in "waiting to start". No errors in any logs are reported. This is a major issue, because we cannot do anything in our production environment without ArgoCD, because we are running on a hosted Kubernetes, and our only "admin" access is ArgoCD.

ML-std commented 3 months ago

In our case, we had to restart CoreDNS and Cilium agents; after that, the HA worked properly. I hope this helps someone

pre commented 1 month ago

Possibly related: Without maxconn 4096 haproxy eats up all available memory and gets OOM Killed. Pod remains in crashloop.