Exposing 9094 tcp port for alertmanager

cortexproject / cortex-helm-chart

Helm chart for Cortex

Apache License 2.0

147 stars 163 forks source link

Exposing 9094 tcp port for alertmanager #420

Open humblebundledore opened 1 year ago

humblebundledore commented 1 year ago

My understanding is that alertmanager is using port 9094 to communicate between peers in cluster mode but master/templates/alertmanager does not have references of exposing this port.

Configuring alertmanager in cluster can be done via :

config
   alertmanager:
      cluster:
        peers: 'cortex-alertmanager-headless.cortex.svc.cluster.local.:9094'

The above configuration seems to be working :

$ k get services -n cortex | grep alertmanager
cortex-alertmanager              ClusterIP   10.xx.xx.201   <none>        8080/TCP   77d
cortex-alertmanager-headless     ClusterIP   None            <none>        8080/TCP   9d

$ k describe pods/cortex-alertmanager-0 -n cortex
    Ports:         8080/TCP, 7946/TCP

$ k describe statefulset/cortex-alertmanager -n cortex
    Ports:       8080/TCP, 7946/TCP
    Host Ports:  0/TCP, 0/TCP

$ kubectl exec -ti cortex-alertmanager-0 -c alertmanager -n cortex -- /bin/sh
/ # nc -zv 127.0.0.1:9094
127.0.0.1:9094 (127.0.0.1:9094) open
/ # nc -zv cortex-alertmanager-headless.cortex.svc.cluster.local:9094
cortex-alertmanager-headless.cortex.svc.cluster.local:9094 (10.xx.xx.119:9094) open

$ k logs -f -n cortex -l app.kubernetes.io/component=alertmanager -c alertmanager
level=debug ts=2022-12-13T07:26:33.456676073Z caller=cluster.go:337 component=cluster memberlist="2022/12/13 07:26:33 [DEBUG] memberlist: Initiating push/pull sync with: 01GKWRxxxxxxxxxQSDT73 10.xx.xx.223:9094\n"

However I believe we should expose correctly 9094, the same as it can be done via cortex-jsonnet (https://github.com/cortexproject/cortex-jsonnet) :

cortex-jsonnet/manifests ∙ grep -r "9094" ./ 
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.peers=alertmanager-0.alertmanager.default.svc.cluster.local:9094,alertmanager-1.alertmanager.default.svc.cluster.local:9094,alertmanager-2.alertmanager.default.svc.cluster.local:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094

In addition, I raised an Issue to Cortex project for a know EOF error in ruler when sending alert to alertmanager and outcome seems to be exposing correctly 9094 via statefulset. https://github.com/cortexproject/cortex/issues/4958

nschad commented 1 year ago

Cool, do you want to contribute this?

humblebundledore commented 1 year ago

Cool, do you want to contribute this?

Sure I will take care of it, just need a bit of free time to write the code. You can assign to me.

humblebundledore commented 1 year ago

I started to look for possible changes and I have some doubt (mostly due to my lack of experience with Cortex).

In order to expose correctly 9094 we need to make adjustment in ports section of the statefulset https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/alertmanager/alertmanager-statefulset.yaml#L174

        ports:
        - name: http-metrics
          containerPort: {{ .Values.config.server.http_listen_port }}
          protocol: TCP
        - name: gossip
          containerPort: {{ .Values.config.memberlist.bind_port }}
          protocol: TCP

If we compare to cortex-jsonnet generated yaml, I have the following question ?

        ports:
        - containerPort: 80
          name: http-metrics
        - containerPort: 9095
          name: grpc
        - containerPort: 9094
          name: gossip-udp
          protocol: UDP
        - containerPort: 9094
          name: gossip-tcp

1# - why by default do we gossip with other cortex micro-services using {{ .Values.config.memberlist.bind_port }} ? 2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/ingester/ingester-statefulset.yaml#L124

I will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.

nschad commented 1 year ago

I started to look for possible changes and I have some doubt (mostly due to my lack of experience with Cortex).

In order to expose correctly 9094 we need to make adjustment in ports section of the statefulset https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/alertmanager/alertmanager-statefulset.yaml#L174
        ports:
        - name: http-metrics
          containerPort: {{ .Values.config.server.http_listen_port }}
          protocol: TCP
        - name: gossip
          containerPort: {{ .Values.config.memberlist.bind_port }}
          protocol: TCP
If we compare to cortex-jsonnet generated yaml, I have the following question ?
        ports:
        - containerPort: 80
          name: http-metrics
        - containerPort: 9095
          name: grpc
        - containerPort: 9094
          name: gossip-udp
          protocol: UDP
        - containerPort: 9094
          name: gossip-tcp
1# - why by default do we gossip with other cortex micro-services using {{ .Values.config.memberlist.bind_port }} ? 2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/ingester/ingester-statefulset.yaml#L124

I will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.

1# - why by default do we gossip with other cortex micro-services using {{ .Values.config.memberlist.bind_port }} ?

Memberlist is enabled by default.

2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed

Just missing. No real reason I guess. See explanation below.

i will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.

Yeah about that, it's confusing. It actually not really relevant technically what you edit in the Port section of a deployment/STS/RS. From the docs:

Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated.

nschad commented 1 year ago

To implement this all you really have to do is.

Add the 2 flags for alert manager cluster.
Add the relevant config in cortex

humblebundledore commented 1 year ago

Information provided are very useful and interesting, thanks as always @nschad.

To implement this all you really have to do is.

Add the 2 flags for alert manager cluster. Add the relevant config in cortex

It does not seems that alertmanager need more flags in order to enable cluster mode, in fact, it seems that cluster mode is enable even by default.

First config without cluster key + no statefulset

# helm values

config:
  alertmanager_storage:
    backend: local
    local:
      path: /data
  alertmanager:
    enable_api: true
    data_dir: /data

# k logs pods/cortex-alertmanager-5f9c44778b-745t4 -n cortex-base -c alertmanager | grep "server.go\|cluster.go"

level=info caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094
level=debug caller=cluster.go:265 component=cluster msg="joined cluster" peers=0
level=info caller=cluster.go:680 component=cluster msg="Waiting for gossip to settle..." interval=200ms
level=info caller=cluster.go:705 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=200.369737ms
level=debug caller=cluster.go:702 component=cluster msg="gossip looks settled" elapsed=401.026841ms

# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/config

alertmanager:
  cluster:
    listen_address: 0.0.0.0:9094
    advertise_address: ""
    peers: ""
    peer_timeout: 15s
    gossip_interval: 200ms
    push_pull_interval: 1m0s

Second config with cluster key + statefulset

# helm values

config:
  alertmanager_storage:
    backend: local
    local:
      path: /data
  alertmanager:
    enable_api: true
    data_dir: /data
    cluster:
      listen_address: '0.0.0.0:9094'
      peers: 'cortex-alertmanager-headless.cortex-base.svc.cluster.local.:9094

# k logs pods/cortex-alertmanager-0 -n cortex-base -c alertmanager | grep "server.go\|cluster.go"

level=info caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=172.17.0.14:9094
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:41:40 [DEBUG] memberlist: Initiating push/pull sync with:  172.17.0.14:9094\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:41:40 [DEBUG] memberlist: Stream connection from=172.17.0.14:37102\n"
level=debug caller=cluster.go:265 component=cluster msg="joined cluster" peers=1
level=info caller=cluster.go:680 component=cluster msg="Waiting for gossip to settle..." interval=200ms
. . .
level=info caller=cluster.go:697 component=cluster msg="gossip settled; proceeding" elapsed=1.001801491s
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:42:58 [DEBUG] memberlist: Stream connection from=172.17.0.1:48172\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:43:05 [DEBUG] memberlist: Initiating push/pull sync with: 01GNEYGESEPJYDCXVF9WPNTXYX 172.17.0.3:9094\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:43:09 [DEBUG] memberlist: Stream connection from=172.17.0.1:42788\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:44:05 [DEBUG] memberlist: Initiating push/pull sync with: 01GNEYGS0MGJMB1403VDKVJQ70 172.17.0.23:9094\n"

# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/config

alertmanager:
  cluster:
    listen_address: 0.0.0.0:9094
    advertise_address: ""
    peers: cortex-alertmanager-headless.cortex-base.svc.cluster.local.:9094
    peer_timeout: 15s
    gossip_interval: 200ms
    push_pull_interval: 1m0s

# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/multitenant_alertmanager/status

Name | Addr
-- | --
01GNEYE2MM174KCJ09VEM39238 | 172.17.0.14
01GNEYGESEPJYDCXVF9WPNTXYX | 172.17.0.3
01GNEYGS0MGJMB1403VDKVJQ70 | 172.17.0.23

Am I missing something here ?

nschad commented 1 year ago

@AlexandreRoux

Uhm. I don't know :shrug: . Sorry

Also reading this

Important: Do not load balance traffic between Prometheus and its Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. The Alertmanager implementation expects all alerts to be sent to all Alertmanagers to ensure high availability.

I'm not really sure how to proceed...

However

I think the missing piece might be the peers: directive? Are you sure that cluster HA is enabled in your first example. The output atleast looks different.

humblebundledore commented 1 year ago

@nschad - I am the same, not too sure how to proceed... and thinking to follow up on #cortex or #prometheus Slack channel to dive further with the help of the community.

In an attempt to reply to the questions, let's compare HA documentation to some observations :

A# First observation

Alertmanager's high availability is in production use at many companies and is enabled by default

--cluster.listen-address string: cluster listen address (default "0.0.0.0:9094"; empty string disables HA mode)

Let's see what happens when we don't use the config.alertmanager.cluster key :

# - - - - - - - - - - #
# - configuration     #
# - - - - - - - - - - #
config:
  alertmanager:
    enable_api: true
    data_dir: /data
alertmanager:
  statefulSet:
    enabled: true
  replicas: 1

# - - - - - - - - - - #
# - result            #
# - - - - - - - - - - #

# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/config

alertmanager:
  cluster:
    listen_address: 0.0.0.0:9094
    advertise_address: ""
    peers: ""
    peer_timeout: 15s
    gossip_interval: 200ms
    push_pull_interval: 1m0s

# http://localhost:8080/multitenant_alertmanager/status
# increase replica will increase peers here
01GNHDW3RMM64BXV76SPDG561K | 172.17.0.14

level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=172.17.0.14:9094
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094

L214 and L255 from altermanager main.go implies that HA cluster will be enable by default unless an empty string being passed to cluster.listen-address. This is inline with documentation and values from /config.

Log lines are then showing cluster.go as caller implying that alertmanager is running in cluster mode.

Important: Do not load balance traffic between Prometheus and its Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. The Alertmanager implementation expects all alerts to be sent to all Alertmanagers to ensure high availability.

This is very correct and I believe that passing the list of alertmanager pods is the right way BUT it does not matter when we talk about what mode is running the alertmanager.

B# Second observation

If running Alertmanager in high availability mode is not desired, setting --cluster.listen-address= prevents Alertmanager from listening to incoming peer requests.

Let's see what happens when we use the config.alertmanager.cluster key with empty string :

# - - - - - - - - - - #
# - configuration     #
# - - - - - - - - - - #
config:
  alertmanager:
    enable_api: true
    data_dir: /data
    cluster:
      listen_address: ''
alertmanager:
  statefulSet:
    enabled: true
  replicas: 3

# - - - - - - - - - - #
# - result            #
# - - - - - - - - - - #

# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/multitenant_alertmanager/status

alertmanager:
  cluster:
    listen_address: ""
    advertise_address: ""
    peers: ""
    peer_timeout: 15s
    gossip_interval: 200ms
    push_pull_interval: 1m0s

# http://localhost:8080/multitenant_alertmanager/status
Alertmanager gossip-based clustering is disabled.

level=info ts=2022-12-30T11:27:02.130479807Z caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"

I was able to noticed that cluster.go is completely remove from the pods logs + multitenant_alertmanager/status stating that gossip-based clustering is disabled. This again seems to match alertmanager documentation and code.

C# Conclusion

In conclusion this let me believe the following :

there is nothing to do in order to enable HA cluster alertmanager in our current helm implementation
config.alertmanager.cluster="" need to be set to disable HA cluster alertmanager
config.alertmanager.cluster.peers should be a list of pods
still don't know about EOF with port TCP/UDP 9094 since Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational

nschad commented 1 year ago

@AlexandreRoux Thank you for your work.

Okay this is my take.

So basically alertmanager was and is already running "cluster" mode. However it does know what his peers are therefore the occasional EOF.

When I configured the alertmanager peers via a headless service, the members do show up in the /multitenant_alertmanager/status dashboard when you port-forward one alertmanager.

This does not happen when you do not configure it. It's probably likely that you now have 4 "alertmanager" clusters when using a replicas: 4

So to me it seems that all we have to do is configure the peers?

humblebundledore commented 1 year ago

I am sorry for the delay to come back on this one...

2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed

Just missing. No real reason I guess. See explanation below.

I have push the https://github.com/cortexproject/cortex-helm-chart/pull/435 in order to correct the headless service :)

So basically alertmanager was and is already running "cluster" mode. However it does know what his peers are therefore the occasional EOF.

It's probably likely that you now have 4 "alertmanager" clusters when using a replicas: 4

I am 90% sure I was able to catch EOF while having all my alertmanagers listed under /multitenant_alertmanager/status but as time past and improvement where made overall, I am going to run more tests to see if EOF still exist for me. I hope to post quickly here again and close the thread.

humblebundledore commented 1 year ago

I am 90% sure I was able to catch EOF while having all my alertmanagers listed under /multitenant_alertmanager/status but as time past and improvement where made overall, I am going to run more tests to see if EOF still exist for me. I hope to post quickly here again and close the thread.

I am (unfortunately) able to reproduce EOF while having a correctly cluster setup :

# - - - - - - - - - - #
# - configuration     #
# - - - - - - - - - - #
config:
  alertmanager:
    enable_api: true
    data_dir: /data
    cluster:
      peers: 'cortex-base-alertmanager-headless.cortex-base.svc.cluster.local.:9094'
  alertmanager_storage:
    backend: "local"
    local:
      path: /data
alertmanager:
  statefulSet:
    enabled: true
  replicas: 2
  extraArgs:
    experimental.alertmanager.enable-api: true
    log.level: error

# - - - - - - - - - - #
# - result            #
# - - - - - - - - - - #

# k port-forward service/cortex-base-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/multitenant_alertmanager/status

<h3>Members</h3>
Name | Addr
-- | --
01GSF8DYT997R63Y5SYXEB2MT0 | 10.233.70.132
01GSF8DHW853BHJP2QSFDJ5N9C | 10.233.68.121

# k logs -f -n cortex-base -l app.kubernetes.io/component=ruler -c rules
level=error ts=2023-02-17T08:30:02.490627604Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-17T14:17:12.402036643Z caller=notifier.go:534 user=tenant-production-2  alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-17T17:44:02.474755591Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-19T03:38:37.988803309Z caller=notifier.go:534 user=tenant-production-2  alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"

Here I am running with the PR https://github.com/cortexproject/cortex-helm-chart/pull/435 to get my headless service corrected :

# k get services -n cortex-base | grep headless
cortex-base-alertmanager-headless     ClusterIP   None            <none>        9095/TCP   47d
cortex-base-distributor-headless      ClusterIP   None            <none>        9095/TCP   47d
cortex-base-ingester-headless         ClusterIP   None            <none>        9095/TCP   47d
cortex-base-query-frontend-headless   ClusterIP   None            <none>        9095/TCP   47d
cortex-base-store-gateway-headless    ClusterIP   None            <none>        9095/TCP   47d

Yeah about that, it's confusing. It actually not really relevant technically what you edit in the Port section of a deployment/STS/RS. From the docs:

Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated.

Even thought I really agree here, I am going to do more tests by tweaking the Port section as friedrichg got his issue fixed this way

I discovered my pod wasn't exposing 9094 tcp port correctly. There is a long standing open kubernetes bug that occurs when there is a port using udp and tcp in the same pod. https://github.com/kubernetes/kubernetes/issues/39188

I solved the problem deleting the statefulset and recreating it for alertmanager

I am also going to try replacing config.alertmanager.cluster.peers with a list of pods addresses OR IPs instead of the k8s headless service for alertmanager.

I will keep posted.

humblebundledore commented 1 year ago

I have run further tests and I am (unfortunately again) able to reproduce EOF, here is my changes / results :

1# Fix alertmanager Port section by exposing correctly 9094 for cluster gossip in TCP + UDP

# cortex/templates/alertmanager/alertmanager-statefulset.yaml
          ports:
            - name: http-metrics
              containerPort: {{ .Values.config.server.http_listen_port }}
              protocol: TCP
            - name: gossip
              containerPort: {{ .Values.config.memberlist.bind_port }}
              protocol: TCP
            - name: grpc
              containerPort: {{ .Values.config.server.grpc_listen_port }}
              protocol: TCP
            - containerPort: 9094
              name: gossip-clu-tcp
              protocol: TCP
            - containerPort: 9094
              name: gossip-clu-udp
              protocol: UDP

# k describe pods/cortex-base-alertmanager-0 -n cortex-base
  alertmanager:
    Ports:         8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP

2# Change the value of config.alertmanager.cluster.peers from headless service to list of pods addresses

# values.yaml
    alertmanager:
      enable_api: true
      data_dir: /data
      cluster:
        peers: 'cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094'

I came back 24hours after the changes and by checking my ruler container logs (level: error) I can still find EOF :

# k logs -f -n cortex-base -l app.kubernetes.io/component=ruler -c rules
level=error ts=2023-02-21T16:31:12.207574825Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=2 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-22T03:12:38.725425758Z caller=notifier.go:534 user=tenant-production-2 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=2 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
# k logs -f -n cortex-base -l app.kubernetes.io/component=alertmanager -c alertmanager

It is worth to mention that alertmanager logs are empty most likely because EOF is a result of ruler closing the connection before the alertmanager idle connection timeout (5min). This has been discussed in : https://github.com/cortexproject/cortex/issues/4958#issuecomment-1309398482

I will edit my PR to include the correction of Port Section with 9094 if we are interested to have it in helm chart. but in term of troubleshooting / fixing EOF, I am afraid that I do not have any further ideas ...