[BUG] [THANOS] No metrics available before 28/12/2022 00:00:00

pcuq-ads commented 1 year ago

Environment:

Platform: OPS Orange Cloud
Configuration: v1.1 / OPS
Infrastructure v0.10

Current Behavior: From Grafana, we can not reach any metrics older then 28/12/2022 00:00:00.

Expected Behavior: Thanos should keep metrics with 1H resolution for 10 years.

Steps To Reproduce: Open a Grafana Panel with Thanos data source. Request any metrics over time. The values older than 8 days are not accessible.

Test execution artefacts (i.e. logs, screenshots…) Datasource Thanos is configured : http://thanos-query.monitoring.svc.cluster.local:9090 The configuration of Thanos compactor seems to be well configured also (xx1h : 10y) No livecycle rule set on Orange Flexible side on bucket named : cluster-ops-thanos

Whenever possible, first analysis of the root cause

Bug Generic Definition of Ready (DoR)

[X] The affect version in which the bug has been found is mentioned
[X] The context and environment of the bug is detailed
[X] The description of the bug is clear and unambiguous
[X] The procedure (steps) to reproduce the bug is clearly detailed
[ ] The tested User Story / features is linked to the bug if available
[X] Logs are attached if available
[X] A data set attached if available

Bug Generic Definition of Done (DoD)

[ ] the modification implemented (the solution to fix the bug) is described in the bug.
[ ] Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
[ ] Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
[ ] Code is compliant with coding rules (SONAR Report as evidence)
[ ] Acceptance criteria of the related User story are checked and Passed

pcuq-ads commented 1 year ago

I propose major as priority (to be discussed on next CCB) because without this feature we can not compute availability Performance Indicator.

suberti-ads commented 1 year ago

Hereafter thanos pod configuration : kubectl get po -n monitoring thanos-compactor-7b69757dbb-642wx -o yaml

[...]
  - args:
    - compact
    - --log.level=info
    - --log.format=logfmt
    - --http-address=0.0.0.0:10902
    - --data-dir=/data
    - --retention.resolution-raw=7d
    - --retention.resolution-5m=30d
    - --retention.resolution-1h=10y
    - --consistency-delay=30m
    - --objstore.config-file=/conf/objstore.yml
    - --wait
    image: docker.io/bitnami/thanos:0.23.1-scratch-r3
[...]

LAQU156 commented 1 year ago

IVV_CCB_2023_w01 : Accepted CS, Priority major

pcuq-ads commented 1 year ago

The bucket cluster-ops-thanos contained around 200 Gb on the 28/12/2022.

So the hypothesis of a clean of the S3 bucket is not a good one.

pcuq-ads commented 1 year ago

Restarting the Thanos PODs did not solve the issue.

pcuq-ads commented 1 year ago

An other hypothesis : Thanos-compactor does not have enough CPU & RAM to run.

pcuq-ads commented 1 year ago

Is it normal to have this configuration for thanos/compactor: storageClass: "ceph-block" ?

LAQU156 commented 1 year ago

CS_CCB_2023_w05 : Analysis requested

LAQU156 commented 1 year ago

CS_CCB_2023_w06 : No news, but this issue has high urgency

LAQU156 commented 1 year ago

CS_CCB_2023_w07 : Still waiting for alanysis from CS dev team

LAQU156 commented 1 year ago

CS_CCB_2023_w08 : No answer on this issue for now

nleconte-csgroup commented 1 year ago

@pcuq-ads @suberti-ads We partially reproduced the behaviour on IVV cluster: We cannot see older metrics' value in Grafana, but we did not observe data loss after restart.

However, we were able to retrieve older value with some tuning. We have to enable auto-downsampling on the thanos querier and restart the thanos querier. After that, we can see some older points :

Then we have to play with the Min step step option of the grafana query, for e.g. using 24h :

Do you want to test it on the OPS platforme ?

eroan-marie commented 1 year ago

Was this tested on the OPS platform ? If not, we are ready to test it on your go

pcuq-ads commented 1 year ago

I am not sure that we have identified the root cause of the problem. Nerveless, I am OK to test this setting in OPS platform. Thank you. Regards

suberti-ads commented 1 year ago

Dear @nleconte-csgroup ,

We have to enable auto-downsampling on the thanos querier and restart the thanos querier. After that, we can see some older points

How activate this option ? As described in compactor application: We want between now -7d : use retentionResolutionRaw between 7d - 30d : use retentionResolution5m between 30d - 10y: use retentionResolution1h I see in https://thanos.io/v0.14/components/query.md/#auto-downsampling should i add max_source_resolution option ?

pcuq-ads commented 1 year ago

Dear @eroan-marie , Could you please proceed with the change ? Thank you

LAQU156 commented 1 year ago

CS_CCB_2023_w10 : In progress, propose workaround deployed on OPS on 2023-03-07, waiting for feedback

pcuq-ads commented 1 year ago

OPS feedback : we see some points in the past. The fix seems to be OK. I propose to let the anomaly under observation 1 month.

pcuq-ads commented 1 year ago

@nleconte-csgroup Unfortunately, after infrastructure 1.5 deployment, metrics are no more available before 01/02/2023.

Two possibilities:

regression with 1.5 deployment
the fix is not efficient

Regards

nleconte-csgroup commented 1 year ago

@pcuq-ads I just checked and it seems OK :

Please mind the "Min step" parameter in your query. It should be not greater than the resolution configured on Thanos.

LAQU156 commented 1 year ago

CS_CCB_2023_w11 : Moved into "Refused CS" to pplace it on OPS board, will be closed at the end of the month

pcuq-ads commented 1 year ago

SYS_CCB_2023_w12 : We can close sooner. The issue is fixed.

pcuq-ads commented 1 year ago

CANCEL comment

COPRS / rs-issues

[BUG] [THANOS] No metrics available before 28/12/2022 00:00:00 #772