PRB0040484 AlertManager reporting hourly specific KubeAPIErrorsHigh failures

wmhutchison commented 3 years ago

Describe the issue Once an hour both KPROD and SILVER clusters are showing specific LIST KubeAPIErrorsHigh errors for podmonitors, prometheusrules and servicemonitors. Appears to have started not long after we upgraded to OCP 4.5.20.

Which Sprint Goal is this issue related to?

Additional context Red Hat case: https://access.redhat.com/support/cases/#/case/02829539

Definition of done Checklist (where applicable)

[x] Open Red Hat case and upload must-gather dump
[x] review feedback from Red Hat on next steps
[x] Test/apply workarounds which may apply before final fix is available or ready to apply
[x] apply final fix

wmhutchison commented 3 years ago

preliminary web searches suggest that https://bugzilla.redhat.com/show_bug.cgi?id=18918 might apply here, which means final fix is an upgrade to OCP 4.5.22.

Opened a Red Hat case though and provided the must-gather for KLAB both to get a second opinion and also ensure nothing else is amiss.

wmhutchison commented 3 years ago

Killing Prometheus operator works as a work-around. Moving to Blocked since we need an OCP upgrade to permanently fix.

StevenBarre commented 3 years ago

Should be fixed in v4.5.22 and will be included in this quarters patching of Silver

StevenBarre commented 3 years ago

Silver upgraded to v4.5.31

BCDevOps / OpenShift4-RollOut

PRB0040484 AlertManager reporting hourly specific KubeAPIErrorsHigh failures #480