cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.45k stars 793 forks source link

Ruler reports an error EOF when sending alert to alertmanager #4958

Open humblebundledore opened 1 year ago

humblebundledore commented 1 year ago

Describe the bug Cortex ruler logs are showing an error EOF when posting alert to Cortex alertmanager.

level=error caller=notifier.go:527 user=tenant-one alertmanager=http://cortex-alertmanager.cortex.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts count=1 msg="Error sending alert" err="Post \"http://cortex-alertmanager.cortex.svc.cluster.local:8080/api/prom/alertmanager/api/v1/alerts\": EOF"

notifier.go is Prometheus code and could miss a req.Close = true as pointed out here.

Bug report exist in Prometheus repo :

The number of file descriptor used by alertmanager process is < default limit of alertmanager file descriptor in my case. This bug does not seems to be tight to a specific alert and happen randomly.

To Reproduce Steps to reproduce the behavior:

  1. setup ruler / alertmanager
  2. send alerts from ruler to alertmanager up to get EOF

Expected behavior alertmanager should receive all POST alerts correctly

Environment:

humblebundledore commented 1 year ago

I have been recommended to fill up a bug from Cortex slack channel so here we go.

Is there a known way to mitigate this issue ? I have setup alertmanager in cluster mode now but I am still looking for a way to verify that all alerts have been posted correctly (despite EOF).

alvinlin123 commented 1 year ago

Thanks for filing this issue. Do you have any gateway in front of alert manager where you can tune the idle connection timeout to be bigger than 5 minutes like mentioned in in this comment: https://github.com/prometheus/prometheus/issues/9057#issuecomment-875429178

alvinlin123 commented 1 year ago

n/m @AlexandreRoux I saw your comment in the prometheus issue, and it sounds like modifying alertmanager server side connection idle timeout is not feasible for you?

friedrichg commented 1 year ago

We've been talking in slack. I think I have had this problem for a while, but was ignoring it because the alerts are eventually sent. I discovered yesterday the issue is gone when I activated alertmanager sharding

My current solution:

Removed this

        --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
        --alertmanager.cluster.peers=alertmanager-0.alertmanager.namespace.svc.cluster.local:9094,alertmanager-1.alertmanager.namespace.svc.cluster.local:9094,alertmanager-2.alertmanager.namespace.svc.cluster.local:9094

Added this:

        -alertmanager.sharding-enabled=true
        -alertmanager.sharding-ring.replication-factor=3
        -alertmanager.sharding-ring.store=memberlist
        -memberlist.abort-if-join-fails=false
        -memberlist.bind-port=7946
        -memberlist.join=gossip-ring.namespace.svc.cluster.local:7946

Note: Unfortunately this is not possible for @AlexandreRoux, because he can't enable alertmanager sharding, he is using local backend for alertmanager

-alertmanager-storage.backend=local
friedrichg commented 1 year ago

Rollback to no sharding and use of alertmanager gossip.

I discovered my pod wasn't exposing 9094 tcp port correctly. There is a long standing open kubernetes bug that occurs when there is a port using udp and tcp in the same pod. https://github.com/kubernetes/kubernetes/issues/39188

I solved the problem deleting the statefulset and recreating it for alertmanager let me know if this helps @AlexandreRoux

humblebundledore commented 1 year ago

@alvinlin123 - I apologies for the delay to reply here.

Indeed as @friedrichg mentioned It seems that 9094 is not expose correctly on my side too. In addition, I also noticed that there might be some missing configuration in the way helm chart is deploying Cortex https://github.com/cortexproject/cortex-helm-chart

$ k get services -n cortex-base | grep alertmanager
cortex-base-alertmanager              ClusterIP   10.xx.xx.201   <none>        8080/TCP   77d
cortex-base-alertmanager-headless     ClusterIP   None            <none>        8080/TCP   9d

$ k describe pods/cortex-base-alertmanager-0 -n cortex-base
    Ports:         8080/TCP, 7946/TCP

$ k describe statefulset/cortex-base-alertmanager -n cortex-base
    Ports:       8080/TCP, 7946/TCP
    Host Ports:  0/TCP, 0/TCP

$ kubectl exec -ti cortex-base-alertmanager-0 -c alertmanager -n cortex-base -- /bin/sh
/ # nc -zv 127.0.0.1:9094
127.0.0.1:9094 (127.0.0.1:9094) open
/ # nc -zv cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094
cortex-base-alertmanager-headless.cortex-base.svc.cluster.local:9094 (10.xx.xx.119:9094) open

$ k logs -f -n cortex-base -l app.kubernetes.io/component=alertmanager -c alertmanager
level=debug ts=2022-12-13T07:26:33.456676073Z caller=cluster.go:337 component=cluster memberlist="2022/12/13 07:26:33 [DEBUG] memberlist: Initiating push/pull sync with: 01GKWRxxxxxxxxxQSDT73 10.xx.xx.223:9094\n"

In https://github.com/cortexproject/cortex-helm-chart it seems we are missing ref to port 9094 In https://github.com/cortexproject/cortex-jsonnet I am able to generate appropriate yaml file like

cortex-jsonnet/manifests ∙ grep -r "9094" ./ 
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.listen-address=[$(POD_IP)]:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - --alertmanager.cluster.peers=alertmanager-0.alertmanager.default.svc.cluster.local:9094,alertmanager-1.alertmanager.default.svc.cluster.local:9094,alertmanager-2.alertmanager.default.svc.cluster.local:9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//apps-v1.StatefulSet-alertmanager.yaml:        - containerPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094
.//v1.Service-alertmanager.yaml:    port: 9094
.//v1.Service-alertmanager.yaml:    targetPort: 9094

I will bring this forward to https://github.com/cortexproject/cortex-helm-chart and help in improving the charts. I think we are good to close here :)

humblebundledore commented 1 year ago

@nschad - as FYI, will maybe open a bug and start to work on getting 9094 added in https://github.com/cortexproject/cortex-helm-chart

humblebundledore commented 1 year ago

@alvinlin123 / @friedrichg - I was able to catch again some times to troubleshoot my EOF with the ruler / alertmanager and unfortunately for me the issue is still present after fixing port 9094 (TCP + UDP) exposure.

Here is all details if you are interested :  https://github.com/cortexproject/cortex-helm-chart/issues/420#issuecomment-1436743702 https://github.com/cortexproject/cortex-helm-chart/issues/420#issuecomment-1439619438