Closed pcuq-ads closed 1 year ago
I propose major as priority (to be discussed on next CCB) because without this feature we can not compute availability Performance Indicator.
Hereafter thanos pod configuration : kubectl get po -n monitoring thanos-compactor-7b69757dbb-642wx -o yaml
[...]
- args:
- compact
- --log.level=info
- --log.format=logfmt
- --http-address=0.0.0.0:10902
- --data-dir=/data
- --retention.resolution-raw=7d
- --retention.resolution-5m=30d
- --retention.resolution-1h=10y
- --consistency-delay=30m
- --objstore.config-file=/conf/objstore.yml
- --wait
image: docker.io/bitnami/thanos:0.23.1-scratch-r3
[...]
IVV_CCB_2023_w01 : Accepted CS, Priority major
The bucket cluster-ops-thanos contained around 200 Gb on the 28/12/2022.
So the hypothesis of a clean of the S3 bucket is not a good one.
Restarting the Thanos PODs did not solve the issue.
An other hypothesis : Thanos-compactor does not have enough CPU & RAM to run.
Is it normal to have this configuration for thanos/compactor: storageClass: "ceph-block" ?
CS_CCB_2023_w05 : Analysis requested
CS_CCB_2023_w06 : No news, but this issue has high urgency
CS_CCB_2023_w07 : Still waiting for alanysis from CS dev team
CS_CCB_2023_w08 : No answer on this issue for now
@pcuq-ads @suberti-ads We partially reproduced the behaviour on IVV cluster: We cannot see older metrics' value in Grafana, but we did not observe data loss after restart.
However, we were able to retrieve older value with some tuning. We have to enable auto-downsampling
on the thanos querier and restart the thanos querier. After that, we can see some older points :
Then we have to play with the Min step
step option of the grafana query, for e.g. using 24h :
Do you want to test it on the OPS platforme ?
Was this tested on the OPS platform ? If not, we are ready to test it on your go
I am not sure that we have identified the root cause of the problem. Nerveless, I am OK to test this setting in OPS platform. Thank you. Regards
Dear @nleconte-csgroup ,
We have to enable auto-downsampling on the thanos querier and restart the thanos querier. After that, we can see some older points
How activate this option ? As described in compactor application: We want between now -7d : use retentionResolutionRaw between 7d - 30d : use retentionResolution5m between 30d - 10y: use retentionResolution1h I see in https://thanos.io/v0.14/components/query.md/#auto-downsampling should i add max_source_resolution option ?
Dear @eroan-marie , Could you please proceed with the change ? Thank you
CS_CCB_2023_w10 : In progress, propose workaround deployed on OPS on 2023-03-07, waiting for feedback
OPS feedback : we see some points in the past. The fix seems to be OK. I propose to let the anomaly under observation 1 month.
@nleconte-csgroup Unfortunately, after infrastructure 1.5 deployment, metrics are no more available before 01/02/2023.
Two possibilities:
Regards
@pcuq-ads I just checked and it seems OK :
Please mind the "Min step" parameter in your query. It should be not greater than the resolution configured on Thanos.
CS_CCB_2023_w11 : Moved into "Refused CS" to pplace it on OPS board, will be closed at the end of the month
SYS_CCB_2023_w12 : We can close sooner. The issue is fixed.
CANCEL comment
Environment:
Current Behavior: From Grafana, we can not reach any metrics older then 28/12/2022 00:00:00.
Expected Behavior: Thanos should keep metrics with 1H resolution for 10 years.
Steps To Reproduce: Open a Grafana Panel with Thanos data source. Request any metrics over time. The values older than 8 days are not accessible.
Test execution artefacts (i.e. logs, screenshots…) Datasource Thanos is configured : http://thanos-query.monitoring.svc.cluster.local:9090 The configuration of Thanos compactor seems to be well configured also (xx1h : 10y) No livecycle rule set on Orange Flexible side on bucket named : cluster-ops-thanos
Whenever possible, first analysis of the root cause
Bug Generic Definition of Ready (DoR)
Bug Generic Definition of Done (DoD)