Open parkedwards opened 8 months ago
Hey @parkedwards - thanks for reaching out and apologies for the delayed response.
The rule-evaluator
binary uses the same evaluation mechanism, configuration surface, and libraries as Prometheus does; i.e. it's using alertmanager_config
verbatim.
So when it comes to BYO alertmanagers, the rule-evaluator is using the same underlying Kubernetes service-discovery as Prometheus does. Specifically, in our stack we take the same approach as prometheus-operator, and use endpoint-based service discovery to find the target for posting alerts to.
Now if the alertmanager pod is rescheduled, presumably its Endpoints
object would be updated with the new IP address. The Prometheus libraries use conventional client-go watch clients to ensure they're getting the most up-to-date state of resources from the K8s API server.
Would you be able to check the state of the corresponding Endpoints
object for your alertmanager? Does it match the IP of what the rule-evaluator is trying to send to?
Hi @pintohutch,
I just restarted the alert manager, and the current endpoints are:
kubectl get endpoints -n monitoring-alertmanager
NAME ENDPOINTS AGE
alertmanager 100.65.113.198:9093,100.65.113.199:9093,100.65.113.200:9093 8d
Here are the logs from the rule evaluator:
2024-05-10 11:03:47.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:47.572593388Z"}
2024-05-10 11:03:52.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:52.572576038Z"}
2024-05-10 11:03:55.856
evaluator {"alertmanager":"http://100.65.113.196:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":8, "err":"Post "http://100.65.113.196:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:03:55.856348962Z"}
2024-05-10 11:03:55.856
evaluator {"alertmanager":"http://100.65.113.197:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":8, "err":"Post "http://100.65.113.197:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:03:55.856512159Z"}
2024-05-10 11:03:57.573
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:03:57.572745544Z"}
2024-05-10 11:04:02.572
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:04:02.572195445Z"}
2024-05-10 11:04:05.858
evaluator {"alertmanager":"http://100.65.113.196:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":7, "err":"Post "http://100.65.113.196:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:04:05.858036106Z"}
2024-05-10 11:04:05.858
evaluator {"alertmanager":"http://100.65.113.197:9093/api/v2/alerts", "caller":"notifier.go:532", "component":"notifier", "count":7, "err":"Post "http://100.65.113.197:9093/api/v2/alerts": context deadline exceeded", "level":"error", "msg":"Error sending alert", "ts":"2024-05-10T09:04:05.858132095Z"}
It's still pointing out the previous endpoints (before the restart). The issue persists until we restart the rule evaluator.
Interesting.
Hey @volkanakcora - could you provide the rule-evaluator config so we can sanity check this on our side?
kubectl get cm -ngmp-system rule-evaluator -oyaml
Hi @pintohutch,
Please find our configs:
kubectl get cm -ngmp-system rule-evaluator -oyaml
apiVersion: v1
data:
config.yaml: |
global: {}
alerting:
alertmanagers:
- follow_redirects: true
enable_http2: true
scheme: http
timeout: 10s
api_version: v2
static_configs:
- targets:
- alertmanager.gmp-system:9093
- follow_redirects: true
enable_http2: true
scheme: http
timeout: 10s
api_version: v2
relabel_configs:
- source_labels: [__meta_kubernetes_endpoints_name]
regex: alertmanager
action: keep
- source_labels: [__address__]
regex: (.+):\d+
target_label: __address__
replacement: $1:9093
action: replace
kubernetes_sd_configs:
- role: endpoints
kubeconfig_file: ""
follow_redirects: true
enable_http2: true
namespaces:
own_namespace: false
names:
- monitoring-alertmanager
rule_files:
- /etc/rules/*.yaml
kind: ConfigMap
metadata:
creationTimestamp: "2023-12-19T14:31:49Z"
name: rule-evaluator
namespace: gmp-system
resourceVersion: "191689827"
uid: 32306108-88c5-4a30-b7b1-b6d45d553686
And here's our alert manager config:
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: alertmanager
spec:
replicas: 3
podManagementPolicy: Parallel
selector:
matchLabels:
app.kubernetes.io/name: alertmanager
serviceName: alertmanager
template:
metadata:
labels:
app.kubernetes.io/name: alertmanager
spec:
topologySpreadConstraints:
- labelSelector:
matchLabels:
name: alert-manager
maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
securityContext:
runAsNonRoot: true
runAsUser: 1000
containers:
- name: alertmanager
image: remote-docker.artifactory.dbgcloud.io/prom/alertmanager
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
- "--web.listen-address=0.0.0.0:9093"
- "--web.external-url=$(webExternalUrl)"
- "--cluster.listen-address=0.0.0.0:9094"
- "--cluster.peer=alertmanager-0.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
- "--cluster.peer=alertmanager-1.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
- "--cluster.peer=alertmanager-2.alertmanager.monitoring-alertmanager.svc.cluster.local:9094"
- "--cluster.peer-timeout=15s"
- "--cluster.gossip-interval=200ms"
- "--cluster.pushpull-interval=1m0s"
- "--cluster.settle-timeout=5s"
- "--cluster.tcp-timeout=10s"
- "--cluster.probe-timeout=500ms"
- "--cluster.probe-interval=1s"
- "--cluster.reconnect-interval=10s"
- "--cluster.reconnect-timeout=6h0m0s"
- "--cluster.label=alertmanager"
ports:
- name: web
containerPort: 9093
- name: cluster
containerPort: 9094
envFrom:
- configMapRef:
name: alert-manager-args
- configMapRef:
name: http-proxy-env
resources:
requests:
cpu: 100m
memory: 100M
limits:
cpu: 250m
memory: 250M
volumeMounts:
- name: config-volume
mountPath: /etc/alertmanager
- name: templates-volume
mountPath: /etc/alertmanager-templates
- name: alertmanager
mountPath: /alertmanager
- name: slack-secrets
mountPath: /secrets/slack
- name: opsgenie-secrets
mountPath: /secrets/opsgenie
readinessProbe:
httpGet:
path: /-/ready
port: web
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 5
successThreshold: 1
livenessProbe:
httpGet:
path: /-/healthy
port: web
initialDelaySeconds: 5
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 5
successThreshold: 1
securityContext:
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
capabilities:
drop:
- ALL
volumes:
- name: config-volume
configMap:
name: alertmanager-config
- name: templates-volume
configMap:
name: alertmanager-templates
- name: alertmanager
emptyDir: {}
- name: slack-secrets
secret:
secretName: slack-secrets
- name: opsgenie-secrets
secret:
secretName: opsgenie-secrets
Let me know if you need further infos.
Hey @volkanakcora - do you have self-monitoring enabled? I'm curious what the value of prometheus_notifications_alertmanagers_discovered{job="rule-evaluator"}
is, particularly before, during, and after alertmanager restart.
Before Alert Manager Restart:
{
__name__="prometheus_notifications_alertmanagers_discovered",
cluster="cluster-dev",
container="evaluator",
instance="rule-evaluator-799bf54847-pdkkf:r-eval-metrics",
job="rule-evaluator",
location="europe-west3",
namespace="gmp-system",
pod="rule-evaluator-799bf54847-pdkkf",
project_id="dbg-energy-dev-659ba550"
}
Restarting the Alert Manager:
kubectl -n monitoring-alertmanager rollout restart statefulset alertmanager
After Alert Manager Restart (No Change also during the restart):
{
__name__="prometheus_notifications_alertmanagers_discovered",
cluster="cluster-dev",
container="evaluator",
instance="rule-evaluator-799bf54847-pdkkf:r-eval-metrics",
job="rule-evaluator",
location="europe-west3",
namespace="gmp-system",
pod="rule-evaluator-799bf54847-pdkkf",
project_id="dbg-energy-dev-659ba550"
}
Restarting the Rule Evaluator:
kubectl rollout restart -n gmp-system deployment rule-evaluator
After Rule Evaluator Restart (New Metrics):
{
__name__="prometheus_notifications_alertmanagers_discovered",
cluster="cluster-dev",
container="evaluator",
instance="rule-evaluator-675998548d-nrkmd:r-eval-metrics",
job="rule-evaluator",
location="europe-west3",
namespace="gmp-system",
pod="rule-evaluator-675998548d-nrkmd",
project_id="dbg-energy-dev-659ba550"
}
I just synced them again via cmd line:
argocd app sync monitoring-applications/alert-manager --local kubernetes-config/envs/development/alert-manager/ --prune
this time it did not create the same issue, and I also checked the rule evaluator discovery, and it also seems to work.
Hey @volkanakcora - thanks for posting the metric details above. However, I'm more interested in the graph, which you posted in the subsequent comment. I would suspect in the broken case, that the line stays at 4 (i.e. the discovery isn't toggled for some reason).
I just synced them again via cmd line: argocd app sync monitoring-applications/alert-manager --local kubernetes-config/envs/development/alert-manager/ --prune
I haven't personally used argo before. What does this do and why do you think it may fix the issue?
Hi @pintohutch,
That's actually the expected behavior, right? When the pod count drops to 3 and recovers to 4, it seems like ArgoCD's intervention functioned as intended (similar to a statefulset restart, but through ArgoCD's GitOps workflow).
In my initial comment, I mentioned rescheduling the pods(via command line), but the rule evaluator itself remained unchanged, so I think we could stick with that problem if I understood it right.
Oh wait - I actually think your debug logs from your second comment hold the clue. Specifically:
evaluator {"caller":"manager.go:359", "component":"discovery manager notify", "level":"debug", "msg":"Discovery receiver's channel was full so will retry the next cycle", "ts":"2024-05-10T09:04:02.572195445Z"}
The discovery manager is not able to keep up with service discovery events (changes) in this case. Is your rule-evaluator resource-starved in any way?
We would actually be able to track the delay through discovery metrics, but we don't enable those in our rule-evaluator (we should!) - filed https://github.com/GoogleCloudPlatform/prometheus-engine/issues/973 to do that.
Hi @pintohutch , this is the known logs/errors we get after every alert manager rescheduling.
Not sure if the rule-evaluator has the resource issue. I had thought about it as well, but did not change anything on the rule evaluator side.
Please see the last 7 days metrics for rule evaluator:
Do you suggest us to increase the rule evaluator resources, and test it again?
Volkan.
Yes please, but it also depends what limits you currently have.
Hi @bwplotka , I'm going to configure VPA for the rule evaluator and test it.
Hi @bwplotka , @pintohutch ,
I have boosted the resources for rule evaluator, alert manager, gmp-operator. However, the result is still the same:
Restarting the entire alert manager application by deleting and recreating the statefulset seems to resolve the issue, but it's not a guaranteed fix. Sometimes, even this approach fails.
Ok thanks for trying and letting us know @volkanakcora. It looks like you're running managed collection on GKE. Are you ok if we take a look from our side to help debug? We can open an internal ticket to track the support work.
I wonder if it's related to https://github.com/prometheus/prometheus/issues/13676...
Hi @pintohutch , it's OK for us, thank you.
It could be related to it, I'm checking it.
hello - we're currently using Managed Prometheus and a self-hosted Alertmanager deployment. This has been functioning properly for over a year. We're currently on this version of
rule-evaluator
our
rule-evaluator
sends events to a self-managed Alertmanager statefulset, which lives in a separate namespace. We configure this via theOperatorConfig
CRD:in the last month or so, we've noticed that the
rule-evaluator
will be unable to resolve the downstream Alertmanager address after the Alertmanager pod is rescheduled.From there, we'll see the
rule-evaluator
log this out:this can go on for an hour - we have pages set up to notify us when the
rule-evaluator
stops pinging Alertmanager through a custom heartbeat rule. The only way to resolve this is by restarting therule-evaluator
deploymentthis suggests that the
rule-evaluator
is not reconciling downstream ip addresses after startup, since we provide the k8s DNS components in theOperatorConfig
for the Alertmanager receiver