Open humblebundledore opened 1 year ago
I have been recommended to fill up a bug from Cortex slack channel so here we go.
Is there a known way to mitigate this issue ? I have setup alertmanager in cluster mode now but I am still looking for a way to verify that all alerts have been posted correctly (despite EOF).
Thanks for filing this issue. Do you have any gateway in front of alert manager where you can tune the idle connection timeout to be bigger than 5 minutes like mentioned in in this comment: https://github.com/prometheus/prometheus/issues/9057#issuecomment-875429178
n/m @AlexandreRoux I saw your comment in the prometheus issue, and it sounds like modifying alertmanager server side connection idle timeout is not feasible for you?
We've been talking in slack. I think I have had this problem for a while, but was ignoring it because the alerts are eventually sent. I discovered yesterday the issue is gone when I activated alertmanager sharding
My current solution:
Removed this
--alertmanager.cluster.listen-address=[$(POD_IP)]:9094
--alertmanager.cluster.peers=alertmanager-0.alertmanager.namespace.svc.cluster.local:9094,alertmanager-1.alertmanager.namespace.svc.cluster.local:9094,alertmanager-2.alertmanager.namespace.svc.cluster.local:9094
Added this:
-alertmanager.sharding-enabled=true
-alertmanager.sharding-ring.replication-factor=3
-alertmanager.sharding-ring.store=memberlist
-memberlist.abort-if-join-fails=false
-memberlist.bind-port=7946
-memberlist.join=gossip-ring.namespace.svc.cluster.local:7946
Note: Unfortunately this is not possible for @AlexandreRoux, because he can't enable alertmanager sharding, he is using local
backend for alertmanager
-alertmanager-storage.backend=local
Rollback to no sharding and use of alertmanager gossip.
I discovered my pod wasn't exposing 9094 tcp port correctly. There is a long standing open kubernetes bug that occurs when there is a port using udp and tcp in the same pod. https://github.com/kubernetes/kubernetes/issues/39188
I solved the problem deleting the statefulset and recreating it for alertmanager let me know if this helps @AlexandreRoux
@alvinlin123 - I apologies for the delay to reply here.
Indeed as @friedrichg mentioned It seems that 9094 is not expose correctly on my side too. In addition, I also noticed that there might be some missing configuration in the way helm chart is deploying Cortex https://github.com/cortexproject/cortex-helm-chart
$ k get services -n cortex-base | grep alertmanager
cortex-base-alertmanager ClusterIP 10.xx.xx.201 <none> 8080/TCP 77d
cortex-base-alertmanager-headless ClusterIP None <none> 8080/TCP 9d
$ k describe pods/cortex-base-alertmanager-0 -n cortex-base
Ports: 8080/TCP, 7946/TCP
$ k describe statefulset/cortex-base-alertmanager -n cortex-base
Ports: 8080/TCP, 7946/TCP
Host Ports: 0/TCP, 0/TCP
$ kubectl exec -ti cortex-base-alertmanager-0 -c alertmanager -n cortex-base -- /bin/sh
/ # nc -zv 127.0.0.1:9094
127.0.0.1:9094 (127.0.0.1:9094) open
/ # nc -zv cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094
cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094 (10.xx.xx.119:9094) open
$ k logs -f -n cortex-base -l app.kubernetes.io/component=alertmanager -c alertmanager
level=debug ts=2022-12-13T07:26:33.456676073Z caller=cluster.go:337 component=cluster memberlist="2022/12/13 07:26:33 [DEBUG] memberlist: Initiating push/pull sync with: 01GKWRxxxxxxxxxQSDT73 10.xx.xx.223:9094\n"
In https://github.com/cortexproject/cortex-helm-chart it seems we are missing ref to port 9094 In https://github.com/cortexproject/cortex-jsonnet I am able to generate appropriate yaml file like
cortex-jsonnet/manifests ∙ grep -r "9094" ./
.//apps-v1.StatefulSet-alertmanager.yaml: - --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
.//apps-v1.StatefulSet-alertmanager.yaml: - --alertmanager.cluster.peers=alertmanager-0.alertmanager.default.svc.cluster.local:9094,alertmanager-1.alertmanager.default.svc.cluster.local:9094,alertmanager-2.alertmanager.default.svc.cluster.local:9094
.//apps-v1.StatefulSet-alertmanager.yaml: - containerPort: 9094
.//apps-v1.StatefulSet-alertmanager.yaml: - containerPort: 9094
.//v1.Service-alertmanager.yaml: port: 9094
.//v1.Service-alertmanager.yaml: targetPort: 9094
.//v1.Service-alertmanager.yaml: port: 9094
.//v1.Service-alertmanager.yaml: targetPort: 9094
I will bring this forward to https://github.com/cortexproject/cortex-helm-chart and help in improving the charts. I think we are good to close here :)
@nschad - as FYI, will maybe open a bug and start to work on getting 9094 added in https://github.com/cortexproject/cortex-helm-chart
@alvinlin123 / @friedrichg - I was able to catch again some times to troubleshoot my EOF with the ruler / alertmanager and unfortunately for me the issue is still present after fixing port 9094 (TCP + UDP) exposure.
Here is all details if you are interested : https://github.com/cortexproject/cortex-helm-chart/issues/420#issuecomment-1436743702 https://github.com/cortexproject/cortex-helm-chart/issues/420#issuecomment-1439619438
Describe the bug Cortex ruler logs are showing an error EOF when posting alert to Cortex alertmanager.
notifier.go is Prometheus code and could miss a
req.Close = true
as pointed out here.Bug report exist in Prometheus repo :
The number of file descriptor used by alertmanager process is < default limit of alertmanager file descriptor in my case. This bug does not seems to be tight to a specific alert and happen randomly.
To Reproduce Steps to reproduce the behavior:
Expected behavior alertmanager should receive all POST alerts correctly
Environment: