cortexproject / cortex

A horizontally scalable, highly available, multi-tenant, long term Prometheus.
https://cortexmetrics.io/
Apache License 2.0
5.47k stars 795 forks source link

Compactor getting crashed during compaction #5603

Open cmg1986 opened 1 year ago

cmg1986 commented 1 year ago

Describe the bug One of the compactor getting crashed continuously which is compacting a bigger tenant block.

To Reproduce Steps to reproduce the behavior:

  1. Start Cortex (SHA or version) 1.15.3
  2. OS : CentOS 7
  3. Start the compaction process

Expected behavior I expect the compaction process is running smoothly OR atleast it should not crash if there is an error too.

Environment:

Additional Context

`{"caller":"compact.go:1291","component":"compactor","level":"info","msg":"start sync of metas","org_id":"GC","ts":"2023-10-17T08:04:48.841909267Z"}
{"caller":"fetcher.go:327","component":"block.BaseFetcher","concurrency":20,"level":"debug","msg":"fetching meta data","org_id":"GC","ts":"2023-10-17T08:04:48.842241761Z"}
{"cached":1124,"caller":"fetcher.go:478","component":"block.BaseFetcher","duration":"1.404187325s","duration_ms":1404,"level":"info","msg":"successfully synchronized block metadata","org_id":"GC","partial":0,"returned":1123,"ts":"2023-10-17T08:04:50.246310867Z"}
{"caller":"compact.go:1296","component":"compactor","level":"info","msg":"start of GC","org_id":"GC","ts":"2023-10-17T08:04:50.246405036Z"}
{"caller":"compact.go:1319","component":"compactor","level":"info","msg":"start of compactions","org_id":"GC","ts":"2023-10-17T08:04:50.344171068Z"}
{"caller":"compact.go:1005","component":"compactor","group":"0@{__org_id__=\"GC\"}","groupKey":"0@7253914978157373696","level":"info","msg":"compaction available and planned; downloading blocks","org_id":"GC","plan":"[01HCW6SYRH47K26KZS2H3K3H18 (min time: 1697414400000, max time: 1697457600000) 01HCW98DQ8Z3A1VF9ZF71SYJZX (min time: 1697450400000, max time: 1697457600000) 01HCW9AQXPMJ3NYDM32BRHA8YH (min time: 1697450400000, max time: 1697457600000) 01HCW9ARN3G0EJMAS4HB3Z8W3F (min time: 1697450400000, max time: 1697457600000) 01HCW9DF4ZGG25BRS09B7F74X9 (min time: 1697450400000, max time: 1697457600000) 01HCW9ANK1TBBD3NDEYW55MEAD (min time: 1697450400000, max time: 1697457600000) 01HCW94VNCB1A1M8S1FVCZWH65 (min time: 1697450400000, max time: 1697457600000) 01HCW9ABKN7SD08GW53CT8B4Q5 (min time: 1697450400000, max time: 1697457600000) 01HCW9A7P4NPB9F615KBD6TGZY (min time: 1697450400000, max time: 1697457600000) 01HCW9ANSN8858929MX4NP5839 (min time: 1697450400000, max time: 1697457600000) 01HCW9AWTXMJC433BS4S2EFMJB (min time: 1697450400000, max time: 1697457600000) 01HCW9AQD3T052P8BWNAMCCQ2D (min time: 1697450400000, max time: 1697457600000) 01HCW97R6N93ERRHQ1XFE06Q5E (min time: 1697450400000, max time: 1697457600000) 01HCW9D4BQRM32S1F6XAZ2C0QZ (min time: 1697450400000, max time: 1697457600000) 01HCW99NCXR8BEWJ04MJB4JRSF (min time: 1697450400000, max time: 1697457600000) 01HCW95E83727PJC00DP7WXX3T (min time: 1697450400000, max time: 1697457600000) 01HCW9AR1S71JZD4B79M1ADYYE (min time: 1697450400000, max time: 1697457600000) 01HCW94Y269VJ81V1GM37V1V3E (min time: 1697450400000, max time: 1697457600000) 01HCW954EP3QMMD0H0QYCAN2X7 (min time: 1697450400000, max time: 1697457600000) 01HCW95G5TXN2T60RWCG3NWXBH (min time: 1697450400000, max time: 1697457600000) 01HCW9AW5JVG0J20YW7037YSCF (min time: 1697450400000, max time: 1697457600000) 01HCW7J5V1Y4AES2VXQZ67JHW7 (min time: 1697450400000, max time: 1697457600000) 01HCW9AGP7AFBXPV0QG1Z9DAJN (min time: 1697450400000, max time: 1697457600000) 01HCW9APYVBEFJP9JBAM2WA4M0 (min time: 1697450400001, max time: 1697457600000) 01HCW957JWMX807P9W0P7B9YY7 (min time: 1697450400001, max time: 1697457600000) 01HCW9CNHZEVJS43N82XJXEP4J (min time: 1697450400001, max time: 1697457600000) 01HCW9D58PM10G7RGAK70CZ6A1 (min time: 1697450400001, max time: 1697457600000) 01HCW950D8AKTNBZGM9GN2HZM7 (min time: 1697450400001, max time: 1697457600000) 01HCW98QZ5Q9SMNZEZCFNJ4XN8 (min time: 1697450400001, max time: 1697457600000) 01HCW95AZ2R9CX4XYT88B1QHXS (min time: 1697450400001, max time: 1697457600000) 01HCW7XY0ZESYAGQVJ3MKDMQKX (min time: 1697450400001, max time: 1697457600000) 01HCW9AVRCS1CZW1VXBMPX7803 (min time: 1697450400001, max time: 1697457600000) 01HCW9DD3FW6KQ841P27KJK5D8 (min time: 1697450400001, max time: 1697457600000) 01HCW9BWNJ9Z98DCD42CN92VRZ (min time: 1697450400001, max time: 1697457600000) 01HCW99Y71N7DHP38XP2FWZBE4 (min time: 1697450400001, max time: 1697457600000) 01HCW9DMC30C8ZVA16NHK4S2RB (min time: 1697450400001, max time: 1697457600000) 01HCW9AWVA3VJXG8CP6SPS1GT8 (min time: 1697450400001, max time: 1697457600000) 01HCW7P0P6KPJN24KZ6CS69XA2 (min time: 1697450400001, max time: 1697457600000) 01HCW95ASC3N3PQ50SFF999XXS (min time: 1697450400001, max time: 1697457600000) 01HCW9AMJDK4FVFCSYA961S9DW (min time: 1697450400001, max time: 1697457600000) 01HCW953R3T6059ZGWHHXXGMWB (min time: 1697450400001, max time: 1697457600000) 01HCW955ZYPHMM71EXS4CCCNPK (min time: 1697450400001, max time: 1697457600000) 01HCW8AFZW1JHJDT89ZQQ2H9G6 (min time: 1697450400001, max time: 1697457600000) 01HCW94YTXY53B5YCNAMXAHDSQ (min time: 1697450400001, max time: 1697457600000) 01HCW9B5D7BPFY06YJSBN9B5KA (min time: 1697450400001, max time: 1697457600000) 01HCW9DSZXTMS823J18NH66AQM (min time: 1697450400001, max time: 1697457600000) 01HCW99FXNGW1C9WS6WBQDCC3K (min time: 1697450400002, max time: 1697457600000) 01HCW9D43HHR6GA4MARC1Z7J2M (min time: 1697450400002, max time: 1697457600000) 01HCW9AH0HYD58VTWGG4S5EX99 (min time: 1697450400002, max time: 1697457600000) 01HCW9AMNZ5C7X5GB1EXG6NJ6R (min time: 1697450400002, max time: 1697457600000) 01HCW98FVQ9RHCBJ9ZBYPSMSSV (min time: 1697450400002, max time: 1697457600000) 01HCW9CAWMRX7YC91BMS59A12E (min time: 1697450400002, max time: 1697457600000) 01HCW9B4BET2CDAKS0A09Q9AJ8 (min time: 1697450400002, max time: 1697457600000) 01HCW96GKHN26HVQHJZ1HQEA3P (min time: 1697450400003, max time: 1697457600000) 01HCW9DE1MZXD7G5ZD3HVNSJ47 (min time: 1697450400003, max time: 1697457600000) 01HCW843AWYT8HQG1VGAW35FQK (min time: 1697450400003, max time: 1697457600000) 01HCW9BREG7X0VT5G1TGYK58QA (min time: 1697450400003, max time: 1697457600000) 01HCW9ARM74A21RQ5BKYXE01YS (min time: 1697450400004, max time: 1697457600000) 01HCW9BRFJV2C94ZJCCJ7HXWT5 (min time: 1697450400004, max time: 1697457600000) 01HCW97ATEXAANENGD472D0EBB (min time: 1697450400004, max time: 1697457600000) 01HCW9547581FW2P1PHTFQ4W9K (min time: 1697450400004, max time: 1697457600000) 01HCW9CVH17X1TTCPJ245XKVQD (min time: 1697450400004, max time: 1697457600000) 01HCW95528PDASMJ5MKVGYHCFQ (min time: 1697450400004, max time: 1697457600000) 01HCW9B5CAD1K9VR0H2EH5RR96 (min time: 1697450400006, max time: 1697457600000) 01HCW9AH4HQBTQM9CMGBMDC8Y8 (min time: 1697450400006, max time: 1697457600000) 01HCW9BWP8AWSFN8MH4P3SYR3C (min time: 1697450400006, max time: 1697457600000) 01HCW95T2VK2CXR65BTQD3XYR2 (min time: 1697450400006, max time: 1697457600000) 01HCW9DVMVAFPXDV4N06GJG4WP (min time: 1697450400006, max time: 1697457600000) 01HCW950VCN1ZBW4TVH30CEPVD (min time: 1697450400006, max time: 1697457600000) 01HCW9B3SAKJEQ1DZRE93MBZCF (min time: 1697450400006, max time: 1697457600000) 01HCW9BB6SX02BF8QE5CT2QBTH (min time: 1697450400006, max time: 1697457600000) 01HCW959YP7JRHPT2ZT3GBYZ3V (min time: 1697450400009, max time: 1697457600000) 01HCW9DRNX544RNFF29C68FT4M (min time: 1697450400009, max time: 1697457600000) 01HCW96AB3CSZ2K77EQMGSAHD6 (min time: 1697450400010, max time: 1697457600000) 01HCWA6VQ5698J86B2NSQVV9WW (min time: 1697452876670, max time: 1697457600000) 01HCWA1RR752S54GJJZX2ZNRKR (min time: 1697452884390, max time: 1697457600000) 01HCW9JHHTZQPD436C8E5WMJFC (min time: 1697452886569, max time: 1697457600000) 01HCWAZM8G2RRKZTEGQ8WG6T0Y (min time: 1697452886758, max time: 1697457600000) 01HCWAHE4K84ZE3S7NXN1E3SHK (min time: 1697452886872, max time: 1697457600000) 01HCW9Q7CTFV22CCYBY8EPETEG (min time: 1697452886972, max time: 1697457600000) 01HCW9WKR852BHJ0ZYQ5790RFZ (min time: 1697452888899, max time: 1697457600000) 01HCWAB5J0897T9VJ26SA36JEV (min time: 1697452888899, max time: 1697457600000) 01HCWAV8FH3R00RQ744VXKEBG9 (min time: 1697452891793, max time: 1697457600000) 01HCWAPP4P47H6GJTE57BQHZ00 (min time: 1697452891793, max time: 1697457600000) 01HCW8DWXHTH4X6WEGJMAMWK5A (min time: 1697452997619, max time: 1697457600000) 01HCW90AKHZENW3V8TBMP9BNB0 (min time: 1697453001867, max time: 1697457600000) 01HCW7TTM6RV53EZ3SSMQ911JK (min time: 1697453060709, max time: 1697457600000) 01HCW7TTM06F8T792ZNGD076EY (min time: 1697453060709, max time: 1697457600000) 01HCWB59HYXASVYR7MH5PJZKCC (min time: 1697456120833, max time: 1697457600000) 01HCWBBHSAXY9D96NQWZATP0A6 (min time: 1697456463481, max time: 1697457600000) 01HCWBKC1F9GC9CMM2YMCPQC6E (min time: 1697456605750, max time: 1697457600000) 01HCWBSSFF9473CPJRGKCR2P2X (min time: 1697456760770, max time: 1697457600000) 01HCWBZWPT5WGKNBZHZARDA3GT (min time: 1697456929949, max time: 1697457600000) 01HCWC7KWQB4MPXM4VTZ9RA6AG (min time: 1697457021954, max time: 1697457600000) 01HCWCDR19J0HG9NG3QPB63Y7Y (min time: 1697457299399, max time: 1697457600000) 01HCWCKGTA82F9CBRYRFKXPQAR (min time: 1697457470943, max time: 1697457600000)]","ts":"2023-10-17T08:04:50.43689535Z"}
{"caller":"objstore.go:361","component":"compactor","file":"01HCW9AQXPMJ3NYDM32BRHA8YH/meta.json","group":"0@{__org_id__=\"GC\"}","groupKey":"0@7253914978157373696","level":"debug","msg":"not downloading again because a provided path matches this one","org_id":"GC","ts":"2023-10-17T08:04:50.449742708Z"}
{"caller":"objstore.go:361","component":"compactor","file":"01HCW98DQ8Z3A1VF9ZF71SYJZX/meta.json","group":"0@{__org_id__=\"GC\"}","groupKey":"0@7253914978157373696","level":"debug","msg":"not downloading again because a provided path matches this one","org_id":"GC","ts":"2023-10-17T08:04:50.453705345Z"}
{"caller":"objstore.go:361","component":"compactor","file":"01HCW6SYRH47K26KZS2H3K3H18/meta.json","group":"0@{__org_id__=\"GC\"}","groupKey":"0@7253914978157373696","level":"debug","msg":"not downloading again because a provided path matches this one","org_id":"GC","ts":"2023-10-17T08:04:50.453822594Z"}
unexpected fault address 0x7ffef52a17ad
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7ffef52a17ad pc=0x9bca58]

goroutine 1792576 [running]:
runtime.throw({0x2863df0?, 0xc001b08fe0?})
        /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc001b08f70 sp=0xc001b08f40 pc=0x43907d
runtime.sigpanic()
        /usr/local/go/src/runtime/signal_unix.go:834 +0x125 fp=0xc001b08fd0 sp=0xc001b08f70 pc=0x44fde5
github.com/dennwc/varint.Uvarint({0x7ffef52a17ad?, 0xc001b09010?, 0x455339?})
        /__w/cortex/cortex/vendor/github.com/dennwc/varint/varint.go:75 +0x18 fp=0xc001b08fd8 sp=0xc001b08fd0 pc=0x9bca58
github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).Uvarint64(0xc001b09060)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/encoding/encoding.go:242 +0x3e fp=0xc001b09000 sp=0xc001b08fd8 pc=0xce817e
github.com/prometheus/prometheus/tsdb/encoding.(*Decbuf).UvarintBytes(0xc001b09060)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/encoding/encoding.go:206 +0x25 fp=0xc001b09020 sp=0xc001b09000 pc=0xce7f25
github.com/prometheus/prometheus/tsdb/index.Symbols.Lookup({{0x2e8a720, 0xc001cda000}, 0x2, 0x5, {0xc005280000, 0x14a50, 0x14a50}, 0x2949f1}, 0x1018)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/index/index.go:1316 +0x245 fp=0xc001b09098 sp=0xc001b09020 pc=0xcf29c5
github.com/prometheus/prometheus/tsdb/index.(*Reader).lookupSymbol(0xc00062e360, 0x428f25?)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/index/index.go:1447 +0xd8 fp=0xc001b09138 sp=0xc001b09098 pc=0xcf3818
github.com/prometheus/prometheus/tsdb/index.(*Reader).lookupSymbol-fm(0x1b09210?)
        <autogenerated>:1 +0x2b fp=0xc001b09158 sp=0xc001b09138 pc=0xcfdceb
github.com/prometheus/prometheus/tsdb/index.(*Decoder).Series(0xc00102e000, {0x7fff2ef9c7c2?, 0x39d437c0?, 0x7ffff7fbd108?}, 0xc001b09570, 0xc001b09558)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/index/index.go:1861 +0x151 fp=0xc001b092c8 sp=0xc001b09158 pc=0xcf63d1
github.com/prometheus/prometheus/tsdb/index.(*Reader).Series(0xc00062e360, 0x39d437c, 0xc001b09570?, 0xc00364b380?)
        /__w/cortex/cortex/vendor/github.com/prometheus/prometheus/tsdb/index/index.go:1611 +0x127 fp=0xc001b09378 sp=0xc001b092c8 pc=0xcf49c7
github.com/thanos-io/thanos/pkg/block.GatherIndexHealthStats({0x2e7a3a0, 0xc00013d770}, {_, _}, _, _)
        /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/block/index.go:255 +0x566 fp=0xc001b09660 sp=0xc001b09378 pc=0x1972306
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func2.1.2({0x2e95968?, 0xc00013c140?})
        /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go:1037 +0xdd fp=0xc001b09978 sp=0xc001b09660 pc=0x1dd26fd
github.com/thanos-io/thanos/pkg/tracing.DoInSpanWithErr({0x2e95968?, 0xc00013c140?}, {0x28a1f68?, 0x8?}, 0xc001b09f38, {0xc0013d8020?, 0xc003443b00?, 0x416050?})
        /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/tracing/tracing.go:82 +0xd0 fp=0xc001b09a18 sp=0xc001b09978 pc=0x1ccad10
github.com/thanos-io/thanos/pkg/compact.(*Group).compact.func2.1()
        /__w/cortex/cortex/vendor/github.com/thanos-io/thanos/pkg/compact/compact.go:1036 +0x45c fp=0xc001b09f78 sp=0xc001b09a18 pc=0x1dd203c
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /__w/cortex/cortex/vendor/golang.org/x/sync/errgroup/errgroup.go:75 +0x64 fp=0xc001b09fe0 sp=0xc001b09f78 pc=0xd012c4
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc001b09fe8 sp=0xc001b09fe0 pc=0x46fd61
created by golang.org/x/sync/errgroup.(*Group).Go
        /__w/cortex/cortex/vendor/golang.org/x/sync/errgroup/errgroup.go:72 +0xa5

goroutine 1 [select, 810 minutes]:
runtime.gopark(0xc000dbc068?, 0x2?, 0xe8?, 0xa4?, 0xc000dbc064?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0017fbef0 sp=0xc0017fbed0 pc=0x43bd96
runtime.selectgo(0xc0017fc068, 0xc000dbc060, 0x16374e0?, 0x0, 0x2e959a0?, 0x1)
        /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc0017fc030 sp=0xc0017fbef0 pc=0x44c03e
github.com/cortexproject/cortex/pkg/util/services.(*Manager).AwaitStopped(0xc001262720, {0x2e959a0, 0xc00007a028})
        /__w/cortex/cortex/pkg/util/services/manager.go:145 +0x6d fp=0xc0017fc098 sp=0xc0017fc030 pc=0x1639dcd
github.com/cortexproject/cortex/pkg/cortex.(*Cortex).Run(0xc000e60000)
        /__w/cortex/cortex/pkg/cortex/cortex.go:459 +0x925 fp=0xc0017fc260 sp=0xc0017fc098 pc=0x2140145
main.main()
        /__w/cortex/cortex/cmd/cortex/main.go:196 +0xdf0 fp=0xc0017fff80 sp=0xc0017fc260 pc=0x214d710
runtime.main()
        /usr/local/go/src/runtime/proc.go:250 +0x207 fp=0xc0017fffe0 sp=0xc0017fff80 pc=0x43b967
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0017fffe8 sp=0xc0017fffe0 pc=0x46fd61

goroutine 2 [force gc (idle), 810 minutes]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000110fb0 sp=0xc000110f90 pc=0x43bd96
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:387
runtime.forcegchelper()
        /usr/local/go/src/runtime/proc.go:305 +0xb0 fp=0xc000110fe0 sp=0xc000110fb0 pc=0x43bbd0
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc000110fe8 sp=0xc000110fe0 pc=0x46fd61
created by runtime.init.6
        /usr/local/go/src/runtime/proc.go:293 +0x25

goroutine 3 [GC sweep wait]:
runtime.gopark(0x42c5f01?, 0x0?, 0x0?, 0x0?, 0x0?)
        /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc000111780 sp=0xc000111760 pc=0x43bd96
runtime.goparkunlock(...)
        /usr/local/go/src/runtime/proc.go:387
runtime.bgsweep(0x0?)
        /usr/local/go/src/runtime/mgcsweep.go:319 +0xde fp=0xc0001117c8 sp=0xc000111780 pc=0x425e3e
runtime.gcenable.func1()
        /usr/local/go/src/runtime/mgc.go:178 +0x26 fp=0xc0001117e0 sp=0xc0001117c8 pc=0x41b0a6
runtime.goexit()
        /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001117e8 sp=0xc0001117e0 pc=0x46fd61
created by runtime.gcenable
        /usr/local/go/src/runtime/mgc.go:178 +0x6b`

Complete Stack trace is available here - https://slack-files.com/T08PSQ7BQ-F062AEM0RA4-72eb738ee7

alexqyle commented 1 year ago

@cmg1986 What is the size of index of each source blocks? And what compaction level of each source blocks?