PRB0040989 - SILVER - Occasional alerts involving Metric Targets for logging pods being unavailable

BCDevOps / developer-experience

This repository is used to track all work for the BCGov Platform Services Team (This includes work for: 1. Platform Experience, 2. Developer Experience 3. Platform Operations/OCP 3)

Apache License 2.0

8 stars 17 forks source link

PRB0040989 - SILVER - Occasional alerts involving Metric Targets for logging pods being unavailable #4982

Closed wmhutchison closed 2 months ago

wmhutchison commented 3 months ago

Describe the issue In recent weeks the on-call member of the Platform Operations team has been receiving spurious alerts from AlertManager on SILVER regarding Prometheus being unable to reach the metrics URL of random collector pods. A vendor case was opened to help troubleshoot this issue. This does not have any impact on the logging services offered by SILVER.

Additional context Vendor ticket: https://access.redhat.com/support/cases/#/case/03875730

How does this benefit the users of our platform? Fewer alerts sent to Platform Ops means more time to work on other things.

Definition of done

[x] Open vendor case.
[x] Work with Support Staff and TAM to move the case forward to resolution state.
[ ] ~~If changes are required, make said changes via RFC as needed.~~

wmhutchison commented 3 months ago

After internal discussion, the call was made that silences would be entered as required in AlertManager after hours so that on-call is not woke up for these specific issues. At present the silence entered is one that will ignore this issue for the APP nodes, but not for the MASTER or INFRA nodes.

wmhutchison commented 2 months ago

Vendor support asked us to review and follow diagnostic instructions per https://access.redhat.com/solutions/7056413 . This article is not technically applicable to us since we don't have a dual-stack network configuration for our pods. Thus took the IPv4 query and ran/uploaded that to the case, as well as the details on the one pod showing issues at the time as well as a sos report from the node hosting it.

Back over to vendor support to analyze and provide further feedback or instructions.

wmhutchison commented 2 months ago

The current "recipe" for creating a silence for this issue for outside of business hour periods for the sake of on-call.

Run the following and set aside the contents as-is.

oc -n openshift-logging get pods -l implementation=fluentd -o wide --no-headers| grep -v mcs-silver-app | awk '{ print $1 }' | paste -s -d\| - | awk '{ print "(" $1 ")" }'

Manually create a new Silence with the following fields/values.

field=alertname, value=CollectorNodeDown field=pod, value=(copy the results of the previous command as is)

For the pod field, check RegEx and Negative matcher.

This ensures that metrics for collector pods involving MASTER or INFRA nodes may still alert after-hours, but we'll not get the spam for APP nodes.

wmhutchison commented 2 months ago

Vendor support has confirmed receipt of requested diagnostic info and are currently reviewing.

wmhutchison commented 2 months ago

during today's vendor TAM meeting, this issue was discussed. While review of uploaded data continues, the current findings point to this being that this non-critical notification is the result of the collector pods working through larger amounts of data while pushing it into the ES cluster.

The next-step that will be explored is to apply a configuration change to AlertManager so that the notification is sent via regular channels instead of to on-call. Will make that change and continue monitoring SILVER.

wmhutchison commented 2 months ago

https://github.com/bcgov-c/platform-ops/pull/499 created for the formal change to AlertManager on all clusters so that CollectorNodeDown events no longer notify on-call and just send a regular email instead. The change was tested by manually applying in SILVER and confirming new iterations of this event no longer notifies on-call.

There are plenty of other and more crucial checks still being made regarding the health of both the Elasticsearch cluster taking in all of the logs, and notifying when log retention for any node falls out of sync (nothing seen in the last 15 minutes).

wmhutchison commented 2 months ago

Will review vendor support's last recommendations and likely close off the ticket since it seems we've done similar recommendations in the past for this issue to no avail. All in all, the current technology is just having a challenge keeping up with the amount of log data involved in SILVER, so even with a reduction in log retention that might not 100% resolve this issue, so moving to just not bugging on-call about the notification in question is likely our best bet for the current situation while we slowly move towards Loki/Vector.

wmhutchison commented 2 months ago

Vendor case closed off, will also update internal DXC PRB ticket and request closure. The Alertmanager notice no longer sends its notice to on-call, and seems to be not happening as much anymore either, so closing this ticket in ZenHub as well.