More debug output why alerts are shown and others not

varac commented 3 years ago

I'm testing the alertmanager integration. However, Nagstamon shows different alerts that the alertmanager UI. I enabled debugging, I can see multiple alerts beeing fetched and processed, but I lack information about why a (silenced) alert is displayed while other active ones are not. Please output more verbose logging in debug mode to troubleshoot this.

HenriWahl commented 3 years ago

@stearz do you have any hint?

stearz commented 3 years ago

@varac: Can you please send me the debug log with the information which alerts are displayed (and should not) and which alerts are not displayed (but should). I will also try to enhance DEBUG logging so it is easier to find the reason for displaying or hiding an alert.

varac commented 3 years ago

@stearz Here's the log; nagstamon-anon.log

These alerts are shown (and should not because they are silenced):

NodeClockNotSynchronising (DOMAIN3)
NodeFilesystemAlmostOutOfSpace (oas.DOMAIN5)

These alerts should be shown cause they are firing and not silenced, but are not:

failed_systemd_units / systemd-rfkill.service (sancho.DOMAIN2)

Thanks for investigating !

stearz commented 3 years ago

I took a look into the debug log but I have no idea what is wrong. I changed the debug output to be hopefully more helpful. @HenriWahl could you merge my PR so that @varac is able to test. (I wasn't able to get GitHub actions to work for me) https://github.com/HenriWahl/Nagstamon/pull/712

HenriWahl commented 3 years ago

@stearz @varac the latest binaries contain the necessary changes. See https://github.com/HenriWahl/Nagstamon/releases/tag/latest

stearz commented 3 years ago

@varac Please install the latest testing version and send me the debug logs again.

varac commented 3 years ago

Thx for the updated nagstamon, it's logging is now much better to understand. Here's what the log show:

DEBUG: 2021-04-15 16:18:23.945438 oas.varac.net detection config (map_to_status_information): 'message,summary,description'
DEBUG: 2021-04-15 16:18:23.945452 oas.varac.net detection config (map_to_hostname): 'pod_name,namespace,instance'
DEBUG: 2021-04-15 16:18:23.945456 oas.varac.net detection config (map_to_servicename): 'alertname'
DEBUG: 2021-04-15 16:18:23.945462 oas.varac.net FetchURL: https://alertmanager.oas.varac.net/api/v2/alerts CGI Data: None
DEBUG: 2021-04-15 16:18:23.950074 oas.varac.net received status code '200' with this content in result.result:
-----------------------------------------------------------------------------------------------------------------------------
[{"annotations":{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},"endsAt":"2021-04-15T14:29:01.069Z","fingerprint":"2e6357288f9e7b4d","receivers":[{"name":"null"}],"startsAt":"2021-04-13T06:29:01.069Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-15T14:17:01.073Z","generatorURL":"http://prometheus.oas.varac.net/graph?g0.expr=vector%281%29\u0026g0.tab=1","labels":{"alertname":"Watchdog","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"none"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-15T14:27:24.381Z","fingerprint":"7b0814d213f92350","receivers":[{"name":"email"}],"startsAt":"2021-04-13T16:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":["0928514d-9555-4fba-80de-e31c421dc1e1"],"state":"suppressed"},"updatedAt":"2021-04-15T14:15:24.383Z","generatorURL":"http://prometheus.oas.varac.net/graph?g0.expr=sum%28apiserver_request%3Aburnrate3d%29+%3E+%281+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate6h%29+%3E+%281+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"3d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"6h"}},{"annotations":{"description":"Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit","summary":"Cluster has overcommitted memory resource requests."},"endsAt":"2021-04-15T14:28:36.809Z","fingerprint":"d98d85c33827c631","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["63fbf3e0-baa5-4a73-b935-381739089357"],"state":"suppressed"},"updatedAt":"2021-04-15T14:16:36.817Z","generatorURL":"http://prometheus.oas.varac.net/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_memory_bytes%3Asum%29+%2F+sum%28kube_node_status_allocatable_memory_bytes%29+%3E+%28count%28kube_node_status_allocatable_memory_bytes%29+-+1%29+%2F+count%28kube_node_status_allocatable_memory_bytes%29\u0026g0.tab=1","labels":{"alertname":"KubeMemoryOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}},{"annotations":{"description":"Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit","summary":"Cluster has overcommitted CPU resource requests."},"endsAt":"2021-04-15T14:28:36.809Z","fingerprint":"f54b182e59fd8d9e","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["0ff8c148-49af-49a7-bfeb-f820608822f0"],"state":"suppressed"},"updatedAt":"2021-04-15T14:16:36.814Z","generatorURL":"http://prometheus.oas.varac.net/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_cpu_cores%3Asum%29+%2F+sum%28kube_node_status_allocatable_cpu_cores%29+%3E+%28count%28kube_node_status_allocatable_cpu_cores%29+-+1%29+%2F+count%28kube_node_status_allocatable_cpu_cores%29\u0026g0.tab=1","labels":{"alertname":"KubeCPUOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}}]
-----------------------------------------------------------------------------------------------------------------------------
DEBUG: 2021-04-15 16:18:23.950118 oas.varac.net processing alert with fingerprint '2e6357288f9e7b4d':
DEBUG: 2021-04-15 16:18:23.950123 oas.varac.net [2e6357288f9e7b4d]: detected severity from labels 'NONE' -> skipping alert
DEBUG: 2021-04-15 16:18:23.950125 oas.varac.net processing alert with fingerprint '7b0814d213f92350':
DEBUG: 2021-04-15 16:18:23.950128 oas.varac.net [7b0814d213f92350]: detected severity from labels 'WARNING'
DEBUG: 2021-04-15 16:18:23.950132 oas.varac.net [7b0814d213f92350]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-15 16:18:23.950135 oas.varac.net [7b0814d213f92350]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-15 16:18:23.950254 oas.varac.net [7b0814d213f92350]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-15 16:18:23.950330 oas.varac.net processing alert with fingerprint 'd98d85c33827c631':
DEBUG: 2021-04-15 16:18:23.950335 oas.varac.net [d98d85c33827c631]: detected severity from labels 'WARNING'
DEBUG: 2021-04-15 16:18:23.950338 oas.varac.net [d98d85c33827c631]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-15 16:18:23.950341 oas.varac.net [d98d85c33827c631]: detected servicename from labels: 'KubeMemoryOvercommit'
DEBUG: 2021-04-15 16:18:23.950407 oas.varac.net [d98d85c33827c631]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-15 16:18:23.950471 oas.varac.net processing alert with fingerprint 'f54b182e59fd8d9e':
DEBUG: 2021-04-15 16:18:23.950476 oas.varac.net [f54b182e59fd8d9e]: detected severity from labels 'WARNING'
DEBUG: 2021-04-15 16:18:23.950479 oas.varac.net [f54b182e59fd8d9e]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-15 16:18:23.950482 oas.varac.net [f54b182e59fd8d9e]: detected servicename from labels: 'KubeCPUOvercommit'
DEBUG: 2021-04-15 16:18:23.950544 oas.varac.net [f54b182e59fd8d9e]: detected status: 'suppressed' -> interpreting as silenced

So interpreting as silenced looks good, as it should, however these alerts still show up in the status window.

stearz commented 3 years ago

I think you might use the 'Acknowledged hosts and services' filter but the alertmanager integration interprets the silenced alerts as services in scheduled downtime.

Try setting the filter 'Hosts and services down for maintenance' to hide those alerts:

varac commented 3 years ago

@stearz That did the trick, thanks. However, it's counter-intuitive so I propose to fix this so sileneced alertmanager alerts can get filtered properly by applying the Acknowledged hosts and services filter. What do you think ?

stearz commented 3 years ago

@varac I'm not sure if a silenced alert is more like a scheduled downtime or like an acknowledgement in Nagios/Icinga. Instead of answering this philosophical question I will try if it is possible to hide silenced alertmanager alerts for both filters (one of them set would be sufficient). If that is possible, I will deliver it in the next release.

varac commented 3 years ago

@stearz Having worked with both Nagios and Prometheus+Alertmanager I can say that an alertmanager silence is very similar than an acknowledgement in Nagios. There's no dedicated concept a a scheduled downtime in alertmanager and I guess you can use silences for that as well. But your proposal is fine for me, thanks for integrating !

varac commented 3 years ago

I found another issue. I have 1 firing alert (6bd1c7fe217b28c2) which is logged as active but not shown in the UI. My active filters are: Acknowledged hosts and services and Hosts and services down for maintenance. Even when I disable all filters, it's not shown. Maybe it's because the it's description (The API server is burning too much error budget.) is the same as an already silenced alert, but has a different label ("short": "2h" instead of "short": "6h").

DEBUG: 2021-04-18 08:57:06.817782 DOMAIN1 detection config (map_to_status_information): 'message,summary,description'
DEBUG: 2021-04-18 08:57:06.817841 DOMAIN1 detection config (map_to_hostname): 'pod_name,namespace,instance'
DEBUG: 2021-04-18 08:57:06.817860 DOMAIN1 detection config (map_to_servicename): 'alertname'
DEBUG: 2021-04-18 08:57:06.817886 DOMAIN1 FetchURL: https://alertmanager.DOMAIN1/api/v2/alerts CGI Data: None
DEBUG: 2021-04-18 08:57:06.830410 DOMAIN1 received status code '200' with this content in result.result:
-----------------------------------------------------------------------------------------------------------------------------
[{"annotations":{"message":"This is an alert meant to ensure that the entire alerting pipeline is functional.\nThis alert is always firing, therefore it should always be firing in Alertmanager\nand always fire against a receiver. There are integrations with various notification\nmechanisms that send a notification when this alert is not firing. For example the\n\"DeadMansSnitch\" integration in PagerDuty.\n"},"endsAt":"2021-04-18T07:08:01.069Z","fingerprint":"2e6357288f9e7b4d","receivers":[{"name":"null"}],"startsAt":"2021-04-13T06:29:01.069Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:56:01.072Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=vector%281%29\u0026g0.tab=1","labels":{"alertname":"Watchdog","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"none"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"6bd1c7fe217b28c2","receivers":[{"name":"email"}],"startsAt":"2021-04-18T06:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":[],"state":"active"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate1d%29+%3E+%283+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate2h%29+%3E+%283+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"1d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"2h"}},{"annotations":{"description":"The API server is burning too much error budget.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn","summary":"The API server is burning too much error budget."},"endsAt":"2021-04-18T07:06:24.381Z","fingerprint":"7b0814d213f92350","receivers":[{"name":"email"}],"startsAt":"2021-04-13T16:18:24.381Z","status":{"inhibitedBy":[],"silencedBy":["0928514d-9555-4fba-80de-e31c421dc1e1"],"state":"suppressed"},"updatedAt":"2021-04-18T06:54:24.383Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28apiserver_request%3Aburnrate3d%29+%3E+%281+%2A+0.01%29+and+sum%28apiserver_request%3Aburnrate6h%29+%3E+%281+%2A+0.01%29\u0026g0.tab=1","labels":{"alertname":"KubeAPIErrorBudgetBurn","long":"3d","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning","short":"6h"}},{"annotations":{"description":"Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit","summary":"Cluster has overcommitted memory resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"d98d85c33827c631","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["63fbf3e0-baa5-4a73-b935-381739089357"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.812Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_memory_bytes%3Asum%29+%2F+sum%28kube_node_status_allocatable_memory_bytes%29+%3E+%28count%28kube_node_status_allocatable_memory_bytes%29+-+1%29+%2F+count%28kube_node_status_allocatable_memory_bytes%29\u0026g0.tab=1","labels":{"alertname":"KubeMemoryOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}},{"annotations":{"description":"Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure.","runbook_url":"https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit","summary":"Cluster has overcommitted CPU resource requests."},"endsAt":"2021-04-18T07:07:36.809Z","fingerprint":"f54b182e59fd8d9e","receivers":[{"name":"email"}],"startsAt":"2021-04-13T06:37:36.809Z","status":{"inhibitedBy":[],"silencedBy":["0ff8c148-49af-49a7-bfeb-f820608822f0"],"state":"suppressed"},"updatedAt":"2021-04-18T06:55:36.811Z","generatorURL":"http://prometheus.DOMAIN1/graph?g0.expr=sum%28namespace%3Akube_pod_container_resource_requests_cpu_cores%3Asum%29+%2F+sum%28kube_node_status_allocatable_cpu_cores%29+%3E+%28count%28kube_node_status_allocatable_cpu_cores%29+-+1%29+%2F+count%28kube_node_status_allocatable_cpu_cores%29\u0026g0.tab=1","labels":{"alertname":"KubeCPUOvercommit","prometheus":"oas/prometheus-stack-kube-prom-prometheus","severity":"warning"}}]
-----------------------------------------------------------------------------------------------------------------------------
DEBUG: 2021-04-18 08:57:06.830663 DOMAIN1 processing alert with fingerprint '2e6357288f9e7b4d':
DEBUG: 2021-04-18 08:57:06.830688 DOMAIN1 [2e6357288f9e7b4d]: detected severity from labels 'NONE' -> skipping alert
DEBUG: 2021-04-18 08:57:06.830700 DOMAIN1 processing alert with fingerprint '6bd1c7fe217b28c2':
DEBUG: 2021-04-18 08:57:06.830713 DOMAIN1 [6bd1c7fe217b28c2]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.830731 DOMAIN1 [6bd1c7fe217b28c2]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.830745 DOMAIN1 [6bd1c7fe217b28c2]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.831293 DOMAIN1 [6bd1c7fe217b28c2]: detected status: 'active'
DEBUG: 2021-04-18 08:57:06.831729 DOMAIN1 processing alert with fingerprint '7b0814d213f92350':
DEBUG: 2021-04-18 08:57:06.831753 DOMAIN1 [7b0814d213f92350]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.831769 DOMAIN1 [7b0814d213f92350]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.831783 DOMAIN1 [7b0814d213f92350]: detected servicename from labels: 'KubeAPIErrorBudgetBurn'
DEBUG: 2021-04-18 08:57:06.832176 DOMAIN1 [7b0814d213f92350]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.832558 DOMAIN1 processing alert with fingerprint 'd98d85c33827c631':
DEBUG: 2021-04-18 08:57:06.832583 DOMAIN1 [d98d85c33827c631]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.832599 DOMAIN1 [d98d85c33827c631]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.832613 DOMAIN1 [d98d85c33827c631]: detected servicename from labels: 'KubeMemoryOvercommit'
DEBUG: 2021-04-18 08:57:06.832982 DOMAIN1 [d98d85c33827c631]: detected status: 'suppressed' -> interpreting as silenced
DEBUG: 2021-04-18 08:57:06.833343 DOMAIN1 processing alert with fingerprint 'f54b182e59fd8d9e':
DEBUG: 2021-04-18 08:57:06.833366 DOMAIN1 [f54b182e59fd8d9e]: detected severity from labels 'WARNING'
DEBUG: 2021-04-18 08:57:06.833382 DOMAIN1 [f54b182e59fd8d9e]: detected hostname from labels: 'unknown'
DEBUG: 2021-04-18 08:57:06.833395 DOMAIN1 [f54b182e59fd8d9e]: detected servicename from labels: 'KubeCPUOvercommit'
DEBUG: 2021-04-18 08:57:06.833799 DOMAIN1 [f54b182e59fd8d9e]: detected status: 'suppressed' -> interpreting as silenced

stearz commented 3 years ago

The combination hostname and servicename can exist only once per Monitor in Nagstamon. The second (silenced) alert overwrites the first (firing).

I am not sure how to solve this.

You could try to use some other field for hostname by changing the hostname detection map but that's no guarantee, that it will not occur again.

@varac / @HenriWahl Do you have any ideas?

varac commented 3 years ago

I'm not familiar with how Nagstamon indexes alerts, but what if Nagstamon uses the alermanager fingerprint as primary index so it can handle alerts which have the same alertnames/description.

varac commented 3 years ago

I'm closing this in favor of a better scoped issue #723

HenriWahl / Nagstamon

More debug output why alerts are shown and others not #709