COPRS / rs-issues

This repository contains all the issues of the COPRS project (Scrum tickets, ivv bugs, epics ...)
2 stars 2 forks source link

[BUG] [THANOS] No metrics available before 28/12/2022 00:00:00 #772

Closed pcuq-ads closed 1 year ago

pcuq-ads commented 1 year ago

Environment:

Current Behavior: From Grafana, we can not reach any metrics older then 28/12/2022 00:00:00.

Expected Behavior: Thanos should keep metrics with 1H resolution for 10 years.

Steps To Reproduce: Open a Grafana Panel with Thanos data source. Request any metrics over time. The values older than 8 days are not accessible.

Test execution artefacts (i.e. logs, screenshots…) Datasource Thanos is configured : http://thanos-query.monitoring.svc.cluster.local:9090 The configuration of Thanos compactor seems to be well configured also (xx1h : 10y) No livecycle rule set on Orange Flexible side on bucket named : cluster-ops-thanos

image.png

Whenever possible, first analysis of the root cause

Bug Generic Definition of Ready (DoR)

Bug Generic Definition of Done (DoD)

pcuq-ads commented 1 year ago

I propose major as priority (to be discussed on next CCB) because without this feature we can not compute availability Performance Indicator.

suberti-ads commented 1 year ago

Hereafter thanos pod configuration : kubectl get po -n monitoring thanos-compactor-7b69757dbb-642wx -o yaml

[...]
  - args:
    - compact
    - --log.level=info
    - --log.format=logfmt
    - --http-address=0.0.0.0:10902
    - --data-dir=/data
    - --retention.resolution-raw=7d
    - --retention.resolution-5m=30d
    - --retention.resolution-1h=10y
    - --consistency-delay=30m
    - --objstore.config-file=/conf/objstore.yml
    - --wait
    image: docker.io/bitnami/thanos:0.23.1-scratch-r3
[...]
LAQU156 commented 1 year ago

IVV_CCB_2023_w01 : Accepted CS, Priority major

pcuq-ads commented 1 year ago

The bucket cluster-ops-thanos contained around 200 Gb on the 28/12/2022. image.png

So the hypothesis of a clean of the S3 bucket is not a good one.

pcuq-ads commented 1 year ago

Restarting the Thanos PODs did not solve the issue.

pcuq-ads commented 1 year ago

An other hypothesis : Thanos-compactor does not have enough CPU & RAM to run.

image.png

image.png

pcuq-ads commented 1 year ago

Is it normal to have this configuration for thanos/compactor: storageClass: "ceph-block" ?

LAQU156 commented 1 year ago

CS_CCB_2023_w05 : Analysis requested

LAQU156 commented 1 year ago

CS_CCB_2023_w06 : No news, but this issue has high urgency

LAQU156 commented 1 year ago

CS_CCB_2023_w07 : Still waiting for alanysis from CS dev team

LAQU156 commented 1 year ago

CS_CCB_2023_w08 : No answer on this issue for now

nleconte-csgroup commented 1 year ago

@pcuq-ads @suberti-ads We partially reproduced the behaviour on IVV cluster: We cannot see older metrics' value in Grafana, but we did not observe data loss after restart.

However, we were able to retrieve older value with some tuning. We have to enable auto-downsampling on the thanos querier and restart the thanos querier. After that, we can see some older points : image

Then we have to play with the Min step step option of the grafana query, for e.g. using 24h : image

Do you want to test it on the OPS platforme ?

eroan-marie commented 1 year ago

Was this tested on the OPS platform ? If not, we are ready to test it on your go

pcuq-ads commented 1 year ago

I am not sure that we have identified the root cause of the problem. Nerveless, I am OK to test this setting in OPS platform. Thank you. Regards

suberti-ads commented 1 year ago

Dear @nleconte-csgroup ,

We have to enable auto-downsampling on the thanos querier and restart the thanos querier. After that, we can see some older points 

How activate this option ? As described in compactor application: We want between now -7d : use retentionResolutionRaw between 7d - 30d : use retentionResolution5m between 30d - 10y: use retentionResolution1h I see in https://thanos.io/v0.14/components/query.md/#auto-downsampling should i add max_source_resolution option ?

pcuq-ads commented 1 year ago

Dear @eroan-marie , Could you please proceed with the change ? Thank you

LAQU156 commented 1 year ago

CS_CCB_2023_w10 : In progress, propose workaround deployed on OPS on 2023-03-07, waiting for feedback

pcuq-ads commented 1 year ago

OPS feedback : we see some points in the past. The fix seems to be OK. I propose to let the anomaly under observation 1 month.

pcuq-ads commented 1 year ago

@nleconte-csgroup Unfortunately, after infrastructure 1.5 deployment, metrics are no more available before 01/02/2023.

image

Two possibilities:

Regards

nleconte-csgroup commented 1 year ago

@pcuq-ads I just checked and it seems OK : image

Please mind the "Min step" parameter in your query. It should be not greater than the resolution configured on Thanos.

LAQU156 commented 1 year ago

CS_CCB_2023_w11 : Moved into "Refused CS" to pplace it on OPS board, will be closed at the end of the month

pcuq-ads commented 1 year ago

SYS_CCB_2023_w12 : We can close sooner. The issue is fixed.

pcuq-ads commented 1 year ago

CANCEL comment