celestiaorg / celestia-node

Celestia Data Availability Nodes
Apache License 2.0
935 stars 931 forks source link

Panic: runtime error: index out of range [3] with length 0 [recovered] #3881

Open Validatus opened 1 month ago

Validatus commented 1 month ago

Celestia Node version

Semantic version: v0.16.0 Commit: 6744f648649ebb5fee1b27faf7aca96ecf4519b2

OS

Rocky Linux 9.4

Install tools

We are building binaries from source to ensure complete control over the compilation process. All needed tools are present on the system.

Others

There are no known resource limitations.

Steps to reproduce it

celestia bridge start --metrics.tls=true --metrics --metrics.endpoint otel.celestia.observer --p2p.network celestia --node.store /storage/.celestia-bridge

Expected result

The expected result is to have the Celestia bridge node fully operational and syncing to the latest network head. This ensures that the node is connected to the network, validating and relaying blocks in real time, and staying up to date with the current state of the blockchain. This allows the bridge node to play its role in cross-chain communication or light client verification within the Celestia network.

Actual result

The panic message "index out of range [3] with length 0" suggests that the program tried to access an index in an array that doesn't exist, causing the program to crash.

panic: runtime error: index out of range [3] with length 0 [recovered]
    panic: 
    == Recovering from initIndex crash ==
    File Info: [ID: 9310013, Size: 502439, Zeros: 502439]
    isEnrypted: false checksumLen: 0 checksum:  indexLen: 0 index: [] 
    == Recovered ==

Relevant log output

/home/user/go/bin/celestia bridge start --metrics.tls=true --metrics --metrics.endpoint otel.celestia.observer --p2p.network celestia --node.store /storage-vault/.celestia-bridge
2024-10-23T18:46:40.159Z    INFO    node    nodebuilder/module.go:27    Accessing keyring...
2024-10-23T18:46:40.196Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:171   All 163 tables opened in 4ms

2024-10-23T18:46:40.201Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/discard.go:66   Discard stats nextEmptySlot: 2979

2024-10-23T18:46:40.201Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:368   Set nextTxnTs to 108712042
2024-10-23T18:46:40.562Z    INFO    module/p2p  p2p/tls.go:54   using a default ws transport
2024/10/23 18:46:40 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.
2024-10-23T18:46:43.881Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:118   2631 tables out of 20890 opened in 3.002s

2024-10-23T18:46:46.882Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:118   5222 tables out of 20890 opened in 6.004s

2024-10-23T18:46:49.882Z    INFO    badger4 v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:118   7676 tables out of 20890 opened in 9.004s

2024-10-23T18:46:50.570Z    INFO    basichost   basic/natmgr.go:112 DiscoverNAT error:no NAT found
panic: runtime error: index out of range [3] with length 0 [recovered]
    panic: 
    == Recovering from initIndex crash ==
    File Info: [ID: 9310013, Size: 502439, Zeros: 502439]
    isEnrypted: false checksumLen: 0 checksum:  indexLen: 0 index: [] 
    == Recovered ==

goroutine 12328 [running]:
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1.1()
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:354 +0x9c
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest.func1()
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:400 +0xa5
panic({0x2ab2460?, 0xc001a06510?})
    /usr/local/go/src/runtime/panic.go:785 +0x132
github.com/google/flatbuffers/go.GetInt32(...)
    /home/user/go/pkg/mod/github.com/google/flatbuffers@v1.12.1/go/encode.go:85
github.com/google/flatbuffers/go.GetUOffsetT(...)
    /home/user/go/pkg/mod/github.com/google/flatbuffers@v1.12.1/go/encode.go:121
github.com/dgraph-io/badger/v4/fb.GetRootAsTableIndex(...)
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/fb/TableIndex.go:14
github.com/dgraph-io/badger/v4/table.(*Table).readTableIndex(0xc002d92480)
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:708 +0x154
github.com/dgraph-io/badger/v4/table.(*Table).initIndex(0xc002d92480)
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:465 +0x19b
github.com/dgraph-io/badger/v4/table.(*Table).initBiggestAndSmallest(0xc002d92480)
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:404 +0x6a
github.com/dgraph-io/badger/v4/table.OpenTable(0xc0025a9260, {0x0, 0x1, 0x200000, 0x0, 0x0, 0x3f847ae147ae147b, 0x1000, 0x0, 0x0, ...})
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/table/table.go:310 +0x23d
github.com/dgraph-io/badger/v4.newLevelsController.func1({0xc005276940, 0x3a}, {0x50?, 0xc011394780?, 0x6aa240?})
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:150 +0x1d9
created by github.com/dgraph-io/badger/v4.newLevelsController in goroutine 1
    /home/user/go/pkg/mod/github.com/dgraph-io/badger/v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:129 +0x565

Oct 23 18:56:34 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:34.825Z        INFO        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/levels.go:171        All 164 tables opened in 4ms
Oct 23 18:56:34 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:34.826Z        INFO        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/discard.go:66        Discard stats nextEmptySlot: 3015
Oct 23 18:56:34 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:34.826Z        INFO        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:368        Set nextTxnTs to 108727503
Oct 23 18:56:34 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:34.995Z        INFO        module/p2p        p2p/tls.go:54        using a default ws transport
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: 2024/10/23 18:56:35 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 7168 kiB, got: 416 kiB). See https://github.com/quic-go/quic-go/wiki/UDP-Buffer-Sizes for details.
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:35.289Z        ERROR        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:263        Received err: Opening table: "/storage-vault/.celestia-bridge/inverted_index/9309990.sst" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: /storage-vault/.celestia-bridge/inverted_index/9309990.sst error: actual: 1265260451, expected: 48412823 error: checksum mismatch
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: github.com/dgraph-io/badger/v4/y.init
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]:         <autogenerated>:1
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: runtime.doInit1
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:7290
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: runtime.doInit
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:7257
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: runtime.main
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:254
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: runtime.goexit
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/asm_amd64.s:1700. Cleaning up...
Oct 23 18:56:35 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:35.290Z        INFO        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:546        Lifetime L0 stalled for: 0s
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:39.786Z        INFO        badger4        v4@v4.2.1-0.20240106094458-1c417aa3799c/db.go:625
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 0 [ ]: NumTables: 00. Size: 0 B of 0 B. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 16 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 1 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 2 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 3 [ ]: NumTables: 00. Size: 0 B of 10 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 2.0 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 4 [ ]: NumTables: 00. Size: 0 B of 11 MiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 4.0 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 5 [B]: NumTables: 18. Size: 74 MiB of 108 MiB. Score: 0.00->0.00 StaleData: 686 KiB Target FileSize: 8.0 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level 6 [ ]: NumTables: 144. Size: 1.0 GiB of 1.0 GiB. Score: 0.00->0.00 StaleData: 0 B Target FileSize: 16 MiB
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Level Done
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:39.789Z        ERROR        peerstore/ds        pstoreds/keybook.go:81        error while updating pubkey in datastore for peer 12D3KooWRH8GqxDAbFgcor9bcs6nAvT7SVk54Yd7fB3M8BsMsDxg: datastore closed
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: Error: could not build arguments for function "github.com/celestiaorg/celestia-node/nodebuilder/share".shrexServerComponents.func1 (/home/cosmos/celestia-node/nodebuilder/share/module.go:156): failed to build *shrexeds.Server: could not build arguments for function "reflect".makeFuncStub (/usr/local/go/src/reflect/asm_amd64.s:28): failed to build *eds.Store: received non-nil error from function "reflect".makeFuncStub (/usr/local/go/src/reflect/asm_amd64.s:28): failed to create index: can't open Badger Datastore: Opening table: "/storage-vault/.celestia-bridge/inverted_index/9309990.sst" error: failed to initialize table error: failed to read index. error: failed to verify checksum for table: /storage-vault/.celestia-bridge/inverted_index/9309990.sst error: actual: 1265260451, expected: 48412823 error: checksum mismatch
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: github.com/dgraph-io/badger/v4/y.init
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]:         <autogenerated>:1
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: runtime.doInit1
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:7290
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: runtime.doInit
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:7257
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: runtime.main
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/proc.go:254
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: runtime.goexit
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]:         /usr/local/go/src/runtime/asm_amd64.s:1700
Oct 23 18:56:39 m29593.contaboserver.net bash[779971]: 2024-10-23T18:56:39.789Z        ERROR        peerstore/ds        pstoreds/keybook.go:81        error while updating pubkey in datastore for peer 12D3KooWEV8gPcFACEwV94g6ftREXykfvg8Ez97z9CDVvp4Bu8px: datastore closed

Is the node "stuck"? Has it stopped syncing?

The log indicates a runtime panic that occurred while using BadgerDB, specifically during the initialization of BadgerDB tables, causing the binary to crash.

Notes

We can easily provide additional logs if necessary. Should you require more detailed information or further diagnostics to help resolve the issue, just let us know, and we’ll be happy to share any logs that can assist with troubleshooting.

Wondertan commented 1 month ago

Did the panic happen during the restart or during regular operation?

This panic happened somewhere deep in the DB we use. My current hypothesis is that the node was ungracefully terminated causing DB corruption

Validatus commented 1 month ago

Did the panic happen during the restart or during regular operation? During "regular operation".

This panic happened somewhere deep in the DB we use. My current hypothesis is that the node was ungracefully terminated causing DB corruption

We share the same working hypothesis. Although the datacenter has denied any sudden power loss or server shutdowns, we’ve observed that during ongoing construction work at the site, these issues tend to occur intermittently. This suggests that the node might have been ungracefully terminated, potentially leading to the database corruption.

We’re currently working to reproduce the error to confirm this theory and will update you once we have more information.