cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.4k stars 782 forks source link

Lost data during compaction on Swift #4055

Open ubcharron opened 3 years ago

ubcharron commented 3 years ago

Describe the bug

The compactor somehow failed to upload the block's index file to Swift, but still deleted the source blocks. There are warnings in the logs, but the compactor does not seem to be aware of them. We lost one day of metrics for our main tenant. (I was hoping to be able to re-generate the index file from the chunks, but that doesn't seem possible as the chunk files only have samples, not the labels themselves.)

We opened a bug in Thanos (https://github.com/thanos-io/thanos/issues/3958), but we're wondering if Cortex would be the more relevant place for it?

To Reproduce

We're not sure how it happens, so here's our best attempt at recollection:

Running Cortex 1.7.0, the Compactor compacted a series of blocks. It then uploaded all resulting files to Swift, but the index file never made it to Swift. In Swift's own logs, there are no traces of the index file ever being uploaded. We /think/ an error might have been detected by "CloseWithLogOnErr", but never made its way back to the Compactor (since it runs as deferred) and thus ignored.

See logs below.

Expected behavior

The Compactor would retry sending a file if there is an error.

Environment:

Storage Engine

Additional Context

Compactor logs:

{  
  "caller": "runutil.go:124",
  "err": "upload object close: Timeout when reading or writing data",
  "level": "warn",
  "msg": "detected close error",
  "ts": "2021-03-20T05:12:44.877771796Z"
}
{
  "bucket": "tracing: cortex-tsdb-prod04",
  "caller": "objstore.go:159",
  "component": "compactor",
  "dst": "01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "from": "data/compact/0@14583055817248146110/01F16ZRT8TYA08VJQR1ZPCC5EP/index",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "debug",
  "msg": "uploaded file",
  "org_id": "1",
  "ts": "2021-03-20T05:12:44.877834603Z"
}
{
  "caller": "compact.go:810",
  "component": "compactor",
  "duration": "4m41.662527735s",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "uploaded block",
  "org_id": "1",
  "result_block": "01F16ZRT8TYA08VJQR1ZPCC5EP",
  "ts": "2021-03-20T05:12:45.140243007Z"
}
{
  "caller": "compact.go:832",
  "component": "compactor",
  "group": "0@{__org_id__=\"1\"}",
  "groupKey": "0@14583055817248146110",
  "level": "info",
  "msg": "marking compacted block for deletion",
  "old_block": "01F15H6D6CXE1ASE788HQECHM4",
  "org_id": "1",
  "ts": "2021-03-20T05:12:45.627586825Z"
}
$ openstack object list cortex-tsdb-prod04 --prefix 1/01F16ZRT8TYA08VJQR1ZPCC5EP
+--------------------------------------------+
| Name                                       |
+--------------------------------------------+
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000001 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000002 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000003 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000004 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000005 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000006 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000007 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000008 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000009 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000010 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000011 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000012 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000013 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000014 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000015 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000016 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000017 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000018 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000019 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000020 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000021 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000022 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/chunks/000023 |
| 1/01F16ZRT8TYA08VJQR1ZPCC5EP/meta.json     |
+--------------------------------------------+
pstibrany commented 3 years ago

Thank you for detailed report. As you point out, this needs to be fixed in Thanos first. We can keep this issue open until we import updated Thanos into Cortex.

pracucci commented 3 years ago

Thanks for your report.

There's a thing I can't understand tho. You mentioned the compact has logged uploaded file on 01F16ZRT8TYA08VJQR1ZPCC5EP/index: this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.

ubcharron commented 3 years ago

From my understanding, an error happened in Upload(), but was only caught during the deferred CloseWithLogOnErr(). Upload() then falsely returned "success" to the compactor as it did not know that the Close() had failed. Based on the return value of Upload(), the compactor thinks the upload was a success, even though it failed.

pstibrany commented 3 years ago

this log gets logged once the upload successfully completes (from the client perspective) so, for the compactor, the index was successfully uploaded.

When using Swift, Upload returns success even if it fails when closing the writer :(

pracucci commented 3 years ago

When using Swift, Upload returns success even if it fails when closing the writer

Oh, it's a Swift client issue! Let's fix it ;)

bcharron commented 3 years ago

This issue is resolved as of Thanos v0.21.0. Would it be possible to update the Thanos vendor pkg, or otherwise cherry-pick this the PR? (https://github.com/thanos-io/thanos/pull/4218)

wilfriedroset commented 2 years ago

I'm facing a somehow similar situation while assessing Cortex where I'm unable to query data from swift due to missing index files.

Environment:

Storage Engine:

Additional information:

In swift logs we can see

fake/01FJDY7WZEWAS8JWHDJMYZP4K9/index PUT 499

Using swift cli we can see that the index file is not present in three places

❯ swift list cortex-lts | grep '01FJDSQ6S8KV3VX8VY8H3Z140V|01FJDWPEYHYNN05G8EHP1J2RTN|01FJDY7WZEWAS8JWHDJMYZP4K9'
fake/01FJDSQ6S8KV3VX8VY8H3Z140V/chunks/000001
fake/01FJDSQ6S8KV3VX8VY8H3Z140V/meta.json
fake/01FJDWPEYHYNN05G8EHP1J2RTN/chunks/000001
fake/01FJDWPEYHYNN05G8EHP1J2RTN/meta.json
fake/01FJDY7WZEWAS8JWHDJMYZP4K9/chunks/000001
fake/01FJDY7WZEWAS8JWHDJMYZP4K9/meta.json

From my queriers I have the following error in several places

Oct 21 15:05:48 cortex-querier-1 cortex[15027]: ts=2021-10-21T15:05:48.812127974Z caller=spanlogger.go:87 org_id=fake traceID=3bfc710c00db95e0 method=blocksStoreQuerier.selectSorted level=warn msg="unable to get store-gateway clients while retrying to fetch missing blocks" err="no store-gateway instance left after filtering out excluded instances for block 01FJDY7WZEWAS8JWHDJMYZP4K9"
Oct 21 15:05:48 cortex-querier-1 cortex[15027]: ts=2021-10-21T15:05:48.812191759Z caller=spanlogger.go:87 org_id=fake traceID=3bfc710c00db95e0 method=blocksStoreQuerier.selectSorted level=warn org_id=fake traceID=3bfc710c00db95e0 msg="failed consistency check" err=null
--8<--
Oct 22 11:59:40 cortex-querier-2 cortex[15649]: level=warn ts=2021-10-22T11:59:40.998663714Z caller=loader.go:117 msg="bucket index not found" user=fake
Oct 22 11:59:41 cortex-querier-2 cortex[15649]: level=warn ts=2021-10-22T11:59:41.00451822Z caller=loader.go:117 msg="bucket index not found" user=fake
Oct 22 11:59:41 cortex-querier-2 cortex[15649]: level=warn ts=2021-10-22T11:59:41.046772046Z caller=loader.go:117 msg="bucket index not found" user=fake

From my ingesters

Oct 21 13:02:05 cortex-ingester-6 cortex[2308]: level=warn ts=2021-10-21T13:02:05.473040855Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"
Oct 22 11:14:23 cortex-ingester-4 cortex[7467]: level=warn ts=2021-10-22T11:14:23.812654644Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"
Oct 20 05:18:33 cortex-ingester-7 cortex[1179]: level=warn ts=2021-10-20T05:18:33.646509588Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"
Oct 20 05:24:37 cortex-ingester-7 cortex[1179]: level=warn ts=2021-10-20T05:24:37.712088727Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"
Oct 20 19:02:21 cortex-ingester-7 cortex[2350]: level=warn ts=2021-10-20T19:02:21.16311545Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"
Oct 22 11:14:15 cortex-ingester-7 cortex[7798]: level=warn ts=2021-10-22T11:14:15.677021073Z caller=runutil.go:126 msg="detected close error" err="upload object close: Timeout when reading or writing data"

From the store-gateways

Oct 21 14:57:30 cortex-store-gateway-1 cortex[14654]: level=warn ts=2021-10-21T14:57:30.795011606Z caller=bucket.go:553 org_id=fake msg="loading block failed" elapsed=87.156225ms id=01FJDY7WZEWAS8JWHDJMYZP4K9 err="create index header reader: write index header: new index reader: get object attributes of 01FJDY7WZEWAS8JWHDJMYZP4K9/index: get object attributes: Object Not Found"
Oct 21 16:48:53 cortex-store-gateway-1 cortex[2069]: level=warn ts=2021-10-21T16:48:53.995736146Z caller=bucket.go:553 org_id=fake msg="loading block failed" elapsed=61.877246ms id=01FJDY7WZEWAS8JWHDJMYZP4K9 err="create index header reader: write index header: new index reader: get object attributes of 01FJDY7WZEWAS8JWHDJMYZP4K9/index: get object attributes: Object Not Found"
Oct 21 16:48:57 cortex-store-gateway-1 cortex[2069]: level=warn ts=2021-10-21T16:48:57.106213354Z caller=bucket.go:553 org_id=fake msg="loading block failed" elapsed=67.927973ms id=01FJDWPEYHYNN05G8EHP1J2RTN err="create index header reader: write index header: new index reader: get object attributes of 01FJDWPEYHYNN05G8EHP1J2RTN/index: get object attributes: Object Not Found"
Oct 21 16:49:05 cortex-store-gateway-1 cortex[2069]: level=warn ts=2021-10-21T16:49:05.626565425Z caller=bucket.go:553 org_id=fake msg="loading block failed" elapsed=97.141065ms id=01FJDSQ6S8KV3VX8VY8H3Z140V err="create index header reader: write index header: new index reader: get object attributes of 01FJDSQ6S8KV3VX8VY8H3Z140V/index: get object attributes: Object Not Found"