Closed viktor-f closed 1 day ago
I like this :) Do we close this internal issue with this fix? Also, i did observe several of these going up to as much as 9-10 sconds, so i think i would prefer increasing the threshold to 10s.
Snippets from my notes looking at this (3 different envs):
ts=2024-10-24T13:10:56.997387562Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.889450532s duration_ms=7889 cached=618 returned=348 partial=0 ts=2024-10-24T13:57:08.483773237Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=9.036448962s duration_ms=9036 cached=740 returned=486 partial=0 ts=2024-10-24T13:06:53.997585723Z caller=fetcher.go:557 level=info component=block.BaseFetcher msg="successfully synchronized block metadata" duration=7.67385069s duration_ms=7673 cached=662 returned=390 partial=0
Yes I think we can close that issue with this PR. I will make a comment in that issue.
Sounds good to increase it to 10s, I will update that and then merge.
What kind of PR is this?
Required: Mark one of the following that is applicable:
Optional: Mark one or more of the following that are applicable:
What does this PR do / why do we need this PR?
We have noticed that every 2nd week there are some jobs in thanos that will cause the 99th percentile latency to spike for storegateway for ~1 day. This does not seem to have any significant impact on the performance for other thanos components and their queries. But it does trigger the
ThanosStoreObjstoreOperationLatencyHigh
alert.The latency spike fairly consistently goes to 1.5-2.5 s and the alert checks for latency larger than 2s. So this PR is changing the alert threshold to latency above 5s.
Information to reviewers
Checklist
NetworkPolicy Dashboard