apps sc: increase threshold for thanosobjstore latency alert

[!warning] This is a public repository, ensure not to disclose:

[ ] personal data beyond what is necessary for interacting with this pull request, nor

[ ] business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

[ ] kind/feature
[x] kind/improvement
[ ] kind/deprecation
[ ] kind/documentation
[ ] kind/clean-up
[ ] kind/bug
[ ] kind/other

Optional: Mark one or more of the following that are applicable:

[!important] Breaking changes should be marked kind/admin-change or kind/dev-change depending on type Critical security fixes should be marked with kind/security

[ ] kind/admin-change
[ ] kind/dev-change
[ ] kind/security
[ ] [kind/adr]()

What does this PR do / why do we need this PR?

We have noticed that every 2nd week there are some jobs in thanos that will cause the 99th percentile latency to spike for storegateway for ~1 day. This does not seem to have any significant impact on the performance for other thanos components and their queries. But it does trigger the ThanosStoreObjstoreOperationLatencyHigh alert.

The latency spike fairly consistently goes to 1.5-2.5 s and the alert checks for latency larger than 2s. So this PR is changing the alert threshold to latency above 5s.

Information to reviewers

Checklist

[x] Proper commit message prefix on all commits
Change checks:
- [x] The change is transparent
- [ ] The change is disruptive
- [x] The change requires no migration steps
- [ ] The change requires migration steps
- [ ] The change upgrades CRDs
- [ ] The change updates the config and the schema
Documentation checks:
- [ ] The public documentation required no updates
- [ ] The public documentation required an update - link to change
Metrics checks:
- [ ] The metrics are still exposed and present in Grafana after the change
- [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
- [ ] The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
Logs checks:
- [ ] The logs do not show any errors after the change
Pod Security Policy checks:
- [ ] Any changed pod is covered by Pod Security Admission
- [ ] Any changed pod is covered by Gatekeeper Pod Security Policies
- [ ] The change does not cause any pods to be blocked by Pod Security Admission or Policies
Network Policy checks:
- [ ] Any changed pod is covered by Network Policies
- [ ] The change does not cause any dropped packets in the NetworkPolicy Dashboard
Audit checks:
- [ ] The change does not cause any unnecessary Kubernetes audit events
- [ ] The change requires changes to Kubernetes audit policy
Falco checks:
- [ ] The change does not cause any alerts to be generated by Falco
Bug checks:
- [ ] The bug fix is covered by regression tests

I like this :) Do we close this internal issue with this fix? Also, i did observe several of these going up to as much as 9-10 sconds, so i think i would prefer increasing the threshold to 10s.

Snippets from my notes looking at this (3 different envs):

ts=2024-10-24T13:10:56.997387562Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.889450532s duration_ms=7889 cached=618 returned=348 partial=0

ts=2024-10-24T13:57:08.483773237Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.036448962s duration_ms=9036 cached=740 returned=486 partial=0

ts=2024-10-24T13:06:53.997585723Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.67385069s duration_ms=7673 cached=662 returned=390 partial=0

Yes I think we can close that issue with this PR. I will make a comment in that issue.

Sounds good to increase it to 10s, I will update that and then merge.

elastisys / compliantkubernetes-apps