Open humblebundledore opened 1 year ago
Cool, do you want to contribute this?
Cool, do you want to contribute this?
Sure I will take care of it, just need a bit of free time to write the code. You can assign to me.
I started to look for possible changes and I have some doubt (mostly due to my lack of experience with Cortex).
In order to expose correctly 9094 we need to make adjustment in ports
section of the statefulset
https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/alertmanager/alertmanager-statefulset.yaml#L174
ports:
- name: http-metrics
containerPort: {{ .Values.config.server.http_listen_port }}
protocol: TCP
- name: gossip
containerPort: {{ .Values.config.memberlist.bind_port }}
protocol: TCP
If we compare to cortex-jsonnet generated yaml, I have the following question ?
ports:
- containerPort: 80
name: http-metrics
- containerPort: 9095
name: grpc
- containerPort: 9094
name: gossip-udp
protocol: UDP
- containerPort: 9094
name: gossip-tcp
1# - why by default do we gossip with other cortex micro-services using {{ .Values.config.memberlist.bind_port }}
?
2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed
https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/ingester/ingester-statefulset.yaml#L124
I will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.
I started to look for possible changes and I have some doubt (mostly due to my lack of experience with Cortex).
In order to expose correctly 9094 we need to make adjustment in
ports
section of the statefulset https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/alertmanager/alertmanager-statefulset.yaml#L174ports: - name: http-metrics containerPort: {{ .Values.config.server.http_listen_port }} protocol: TCP - name: gossip containerPort: {{ .Values.config.memberlist.bind_port }} protocol: TCP
If we compare to cortex-jsonnet generated yaml, I have the following question ?
ports: - containerPort: 80 name: http-metrics - containerPort: 9095 name: grpc - containerPort: 9094 name: gossip-udp protocol: UDP - containerPort: 9094 name: gossip-tcp
1# - why by default do we gossip with other cortex micro-services using
{{ .Values.config.memberlist.bind_port }}
? 2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed https://github.com/cortexproject/cortex-helm-chart/blob/master/templates/ingester/ingester-statefulset.yaml#L124I will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.
1# - why by default do we gossip with other cortex micro-services using {{ .Values.config.memberlist.bind_port }} ?
Memberlist is enabled by default.
2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed
Just missing. No real reason I guess. See explanation below.
i will obviously try / break / tweak things locally :), but if there is some explanation on how things are supposed to communicate somewhere, I am very interested to know about it.
Yeah about that, it's confusing. It actually not really relevant technically what you edit in the Port section of a deployment
/STS
/RS
. From the docs:
Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated.
To implement this all you really have to do is.
Information provided are very useful and interesting, thanks as always @nschad.
To implement this all you really have to do is.
Add the 2 flags for alert manager cluster. Add the relevant config in cortex
It does not seems that alertmanager need more flags in order to enable cluster mode, in fact, it seems that cluster mode is enable even by default.
First config without cluster key + no statefulset
# helm values
config:
alertmanager_storage:
backend: local
local:
path: /data
alertmanager:
enable_api: true
data_dir: /data
# k logs pods/cortex-alertmanager-5f9c44778b-745t4 -n cortex-base -c alertmanager | grep "server.go\|cluster.go"
level=info caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094
level=debug caller=cluster.go:265 component=cluster msg="joined cluster" peers=0
level=info caller=cluster.go:680 component=cluster msg="Waiting for gossip to settle..." interval=200ms
level=info caller=cluster.go:705 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=200.369737ms
level=debug caller=cluster.go:702 component=cluster msg="gossip looks settled" elapsed=401.026841ms
# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/config
alertmanager:
cluster:
listen_address: 0.0.0.0:9094
advertise_address: ""
peers: ""
peer_timeout: 15s
gossip_interval: 200ms
push_pull_interval: 1m0s
Second config with cluster key + statefulset
# helm values
config:
alertmanager_storage:
backend: local
local:
path: /data
alertmanager:
enable_api: true
data_dir: /data
cluster:
listen_address: '0.0.0.0:9094'
peers: 'cortex-alertmanager-headless.cortex-base.svc.cluster.local.:9094
# k logs pods/cortex-alertmanager-0 -n cortex-base -c alertmanager | grep "server.go\|cluster.go"
level=info caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=172.17.0.14:9094
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:41:40 [DEBUG] memberlist: Initiating push/pull sync with: 172.17.0.14:9094\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:41:40 [DEBUG] memberlist: Stream connection from=172.17.0.14:37102\n"
level=debug caller=cluster.go:265 component=cluster msg="joined cluster" peers=1
level=info caller=cluster.go:680 component=cluster msg="Waiting for gossip to settle..." interval=200ms
. . .
level=info caller=cluster.go:697 component=cluster msg="gossip settled; proceeding" elapsed=1.001801491s
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:42:58 [DEBUG] memberlist: Stream connection from=172.17.0.1:48172\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:43:05 [DEBUG] memberlist: Initiating push/pull sync with: 01GNEYGESEPJYDCXVF9WPNTXYX 172.17.0.3:9094\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:43:09 [DEBUG] memberlist: Stream connection from=172.17.0.1:42788\n"
level=debug caller=cluster.go:338 component=cluster memberlist="2022/12/29 12:44:05 [DEBUG] memberlist: Initiating push/pull sync with: 01GNEYGS0MGJMB1403VDKVJQ70 172.17.0.23:9094\n"
# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/config
alertmanager:
cluster:
listen_address: 0.0.0.0:9094
advertise_address: ""
peers: cortex-alertmanager-headless.cortex-base.svc.cluster.local.:9094
peer_timeout: 15s
gossip_interval: 200ms
push_pull_interval: 1m0s
# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:9981/multitenant_alertmanager/status
Name | Addr
-- | --
01GNEYE2MM174KCJ09VEM39238 | 172.17.0.14
01GNEYGESEPJYDCXVF9WPNTXYX | 172.17.0.3
01GNEYGS0MGJMB1403VDKVJQ70 | 172.17.0.23
Am I missing something here ?
@AlexandreRoux
Uhm. I don't know :shrug: . Sorry
Also reading this
Important: Do not load balance traffic between Prometheus and its Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. The Alertmanager implementation expects all alerts to be sent to all Alertmanagers to ensure high availability.
I'm not really sure how to proceed...
However
I think the missing piece might be the peers:
directive? Are you sure that cluster HA is enabled in your first example. The output atleast looks different.
@nschad - I am the same, not too sure how to proceed... and thinking to follow up on #cortex or #prometheus Slack channel to dive further with the help of the community.
In an attempt to reply to the questions, let's compare HA documentation to some observations :
Alertmanager's high availability is in production use at many companies and is enabled by default
--cluster.listen-address string: cluster listen address (default "0.0.0.0:9094"; empty string disables HA mode)
Let's see what happens when we don't use the config.alertmanager.cluster key :
# - - - - - - - - - - #
# - configuration #
# - - - - - - - - - - #
config:
alertmanager:
enable_api: true
data_dir: /data
alertmanager:
statefulSet:
enabled: true
replicas: 1
# - - - - - - - - - - #
# - result #
# - - - - - - - - - - #
# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/config
alertmanager:
cluster:
listen_address: 0.0.0.0:9094
advertise_address: ""
peers: ""
peer_timeout: 15s
gossip_interval: 200ms
push_pull_interval: 1m0s
# http://localhost:8080/multitenant_alertmanager/status
# increase replica will increase peers here
01GNHDW3RMM64BXV76SPDG561K | 172.17.0.14
level=debug caller=cluster.go:173 component=cluster msg="resolved peers to following addresses" peers=172.17.0.14:9094
level=info caller=cluster.go:185 component=cluster msg="setting advertise address explicitly" addr=172.17.0.14 port=9094
L214 and L255 from altermanager main.go implies that HA cluster will be enable by default unless an empty string being passed to cluster.listen-address
. This is inline with documentation and values from /config.
Log lines are then showing cluster.go as caller implying that alertmanager is running in cluster mode.
Important: Do not load balance traffic between Prometheus and its Alertmanagers, but instead point Prometheus to a list of all Alertmanagers. The Alertmanager implementation expects all alerts to be sent to all Alertmanagers to ensure high availability.
This is very correct and I believe that passing the list of alertmanager pods is the right way BUT it does not matter when we talk about what mode is running the alertmanager.
If running Alertmanager in high availability mode is not desired, setting --cluster.listen-address= prevents Alertmanager from listening to incoming peer requests.
Let's see what happens when we use the config.alertmanager.cluster key with empty string :
# - - - - - - - - - - #
# - configuration #
# - - - - - - - - - - #
config:
alertmanager:
enable_api: true
data_dir: /data
cluster:
listen_address: ''
alertmanager:
statefulSet:
enabled: true
replicas: 3
# - - - - - - - - - - #
# - result #
# - - - - - - - - - - #
# k port-forward service/cortex-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/multitenant_alertmanager/status
alertmanager:
cluster:
listen_address: ""
advertise_address: ""
peers: ""
peer_timeout: 15s
gossip_interval: 200ms
push_pull_interval: 1m0s
# http://localhost:8080/multitenant_alertmanager/status
Alertmanager gossip-based clustering is disabled.
level=info ts=2022-12-30T11:27:02.130479807Z caller=server.go:260 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
I was able to noticed that cluster.go is completely remove from the pods logs + multitenant_alertmanager/status stating that gossip-based clustering is disabled. This again seems to match alertmanager documentation and code.
In conclusion this let me believe the following :
config.alertmanager.cluster=""
need to be set to disable HA cluster alertmanagerconfig.alertmanager.cluster.peers
should be a list of pods@AlexandreRoux Thank you for your work.
Okay this is my take.
So basically alertmanager was and is already running "cluster" mode. However it does know what his peers are therefore the occasional EOF.
When I configured the alertmanager peers via a headless service, the members do show up in the /multitenant_alertmanager/status
dashboard when you port-forward one alertmanager.
This does not happen when you do not configure it. It's probably likely that you now have 4 "alertmanager" clusters when using a replicas: 4
So to me it seems that all we have to do is configure the peers?
I am sorry for the delay to come back on this one...
2# - why there is no grpc port exposed ? For example ingesters do have grpc exposed
Just missing. No real reason I guess. See explanation below.
I have push the https://github.com/cortexproject/cortex-helm-chart/pull/435 in order to correct the headless service :)
So basically alertmanager was and is already running "cluster" mode. However it does know what his peers are therefore the occasional EOF.
It's probably likely that you now have 4 "alertmanager" clusters when using a replicas: 4
I am 90% sure I was able to catch EOF while having all my alertmanagers listed under /multitenant_alertmanager/status
but as time past and improvement where made overall, I am going to run more tests to see if EOF still exist for me.
I hope to post quickly here again and close the thread.
I am 90% sure I was able to catch EOF while having all my alertmanagers listed under /multitenant_alertmanager/status but as time past and improvement where made overall, I am going to run more tests to see if EOF still exist for me. I hope to post quickly here again and close the thread.
I am (unfortunately) able to reproduce EOF while having a correctly cluster setup :
# - - - - - - - - - - #
# - configuration #
# - - - - - - - - - - #
config:
alertmanager:
enable_api: true
data_dir: /data
cluster:
peers: 'cortex-base-alertmanager-headless.cortex-base.svc.cluster.local.:9094'
alertmanager_storage:
backend: "local"
local:
path: /data
alertmanager:
statefulSet:
enabled: true
replicas: 2
extraArgs:
experimental.alertmanager.enable-api: true
log.level: error
# - - - - - - - - - - #
# - result #
# - - - - - - - - - - #
# k port-forward service/cortex-base-alertmanager 8080:8080 -n cortex-base > /dev/null &
# http://localhost:8080/multitenant_alertmanager/status
<h3>Members</h3>
Name | Addr
-- | --
01GSF8DYT997R63Y5SYXEB2MT0 | 10.233.70.132
01GSF8DHW853BHJP2QSFDJ5N9C | 10.233.68.121
# k logs -f -n cortex-base -l app.kubernetes.io/component=ruler -c rules
level=error ts=2023-02-17T08:30:02.490627604Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-17T14:17:12.402036643Z caller=notifier.go:534 user=tenant-production-2 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-17T17:44:02.474755591Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-19T03:38:37.988803309Z caller=notifier.go:534 user=tenant-production-2 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
Here I am running with the PR https://github.com/cortexproject/cortex-helm-chart/pull/435 to get my headless service corrected :
# k get services -n cortex-base | grep headless
cortex-base-alertmanager-headless ClusterIP None <none> 9095/TCP 47d
cortex-base-distributor-headless ClusterIP None <none> 9095/TCP 47d
cortex-base-ingester-headless ClusterIP None <none> 9095/TCP 47d
cortex-base-query-frontend-headless ClusterIP None <none> 9095/TCP 47d
cortex-base-store-gateway-headless ClusterIP None <none> 9095/TCP 47d
Yeah about that, it's confusing. It actually not really relevant technically what you edit in the Port section of a deployment/STS/RS. From the docs:
Exposing a port here gives the system additional information about the network connections a container uses, but is primarily informational. Not specifying a port here DOES NOT prevent that port from being exposed. Any port which is listening on the default "0.0.0.0" address inside a container will be accessible from the network. Cannot be updated.
Even thought I really agree here, I am going to do more tests by tweaking the Port section as friedrichg got his issue fixed this way
I discovered my pod wasn't exposing 9094 tcp port correctly. There is a long standing open kubernetes bug that occurs when there is a port using udp and tcp in the same pod. https://github.com/kubernetes/kubernetes/issues/39188
I solved the problem deleting the statefulset and recreating it for alertmanager
I am also going to try replacing config.alertmanager.cluster.peers
with a list of pods addresses OR IPs instead of the k8s headless service for alertmanager.
I will keep posted.
I have run further tests and I am (unfortunately again) able to reproduce EOF, here is my changes / results :
1# Fix alertmanager Port section by exposing correctly 9094 for cluster gossip in TCP + UDP
# cortex/templates/alertmanager/alertmanager-statefulset.yaml
ports:
- name: http-metrics
containerPort: {{ .Values.config.server.http_listen_port }}
protocol: TCP
- name: gossip
containerPort: {{ .Values.config.memberlist.bind_port }}
protocol: TCP
- name: grpc
containerPort: {{ .Values.config.server.grpc_listen_port }}
protocol: TCP
- containerPort: 9094
name: gossip-clu-tcp
protocol: TCP
- containerPort: 9094
name: gossip-clu-udp
protocol: UDP
# k describe pods/cortex-base-alertmanager-0 -n cortex-base
alertmanager:
Ports: 8080/TCP, 7946/TCP, 9095/TCP, 9094/TCP, 9094/UDP
2# Change the value of config.alertmanager.cluster.peers
from headless service to list of pods addresses
# values.yaml
alertmanager:
enable_api: true
data_dir: /data
cluster:
peers: 'cortex-base-alertmanager-0.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094,cortex-base-alertmanager-1.cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094'
I came back 24hours after the changes and by checking my ruler container logs (level: error) I can still find EOF :
# k logs -f -n cortex-base -l app.kubernetes.io/component=ruler -c rules
level=error ts=2023-02-21T16:31:12.207574825Z caller=notifier.go:534 user=tenant-production-1 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=2 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
level=error ts=2023-02-22T03:12:38.725425758Z caller=notifier.go:534 user=tenant-production-2 alertmanager=http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=2 msg="Error sending alert" err="Post \"http://cortex-base-alertmanager.cortex-base.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"
# k logs -f -n cortex-base -l app.kubernetes.io/component=alertmanager -c alertmanager
It is worth to mention that alertmanager logs are empty most likely because EOF is a result of ruler closing the connection before the alertmanager idle connection timeout (5min). This has been discussed in : https://github.com/cortexproject/cortex/issues/4958#issuecomment-1309398482
I will edit my PR to include the correction of Port Section with 9094 if we are interested to have it in helm chart. but in term of troubleshooting / fixing EOF, I am afraid that I do not have any further ideas ...
My understanding is that alertmanager is using port 9094 to communicate between peers in cluster mode but master/templates/alertmanager does not have references of exposing this port.
Configuring alertmanager in cluster can be done via :
The above configuration seems to be working :
However I believe we should expose correctly 9094, the same as it can be done via cortex-jsonnet (https://github.com/cortexproject/cortex-jsonnet) :
In addition, I raised an Issue to Cortex project for a know EOF error in ruler when sending alert to alertmanager and outcome seems to be exposing correctly 9094 via statefulset. https://github.com/cortexproject/cortex/issues/4958