elastisys / compliantkubernetes-apps

Elastisys Compliant Kubernetes is an open source, Certified Kubernetes distribution designed according to the ISO27001 controls: providing you with security tooling and observability from day one.
https://elastisys.io/compliantkubernetes/
Apache License 2.0
46 stars 8 forks source link

apps sc: increase threshold for thanosobjstore latency alert #2356

Closed viktor-f closed 1 day ago

viktor-f commented 4 days ago

[!warning] This is a public repository, ensure not to disclose:

  • [ ] personal data beyond what is necessary for interacting with this pull request, nor
  • [ ] business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

Optional: Mark one or more of the following that are applicable:

[!important] Breaking changes should be marked kind/admin-change or kind/dev-change depending on type Critical security fixes should be marked with kind/security

What does this PR do / why do we need this PR?

We have noticed that every 2nd week there are some jobs in thanos that will cause the 99th percentile latency to spike for storegateway for ~1 day. This does not seem to have any significant impact on the performance for other thanos components and their queries. But it does trigger the ThanosStoreObjstoreOperationLatencyHigh alert.

The latency spike fairly consistently goes to 1.5-2.5 s and the alert checks for latency larger than 2s. So this PR is changing the alert threshold to latency above 5s.

Information to reviewers

Checklist

viktor-f commented 1 day ago

I like this :) Do we close this internal issue with this fix? Also, i did observe several of these going up to as much as 9-10 sconds, so i think i would prefer increasing the threshold to 10s.

Snippets from my notes looking at this (3 different envs):

ts=2024-10-24T13:10:56.997387562Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.889450532s duration_ms=7889 cached=618 returned=348 partial=0

ts=2024-10-24T13:57:08.483773237Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.036448962s duration_ms=9036 cached=740 returned=486 partial=0

ts=2024-10-24T13:06:53.997585723Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.67385069s duration_ms=7673 cached=662 returned=390 partial=0

Yes I think we can close that issue with this PR. I will make a comment in that issue.

Sounds good to increase it to 10s, I will update that and then merge.