filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.84k stars 1.26k forks source link

Running out of disk space causes blockchain disk corruption #9292

Closed ArseniiPetrovich closed 2 years ago

ArseniiPetrovich commented 2 years ago

Checklist

Lotus component

Lotus Version

lotus version 1.17.2-dev+calibnet+git.29fff4f

Describe the Bug

Here at Lotus nodes we unfortunatelly run out of disk space recently on one of our archival nodes on calibrationnet. It was running 1.16.0, and when we restarted it failed with the following issue:

2022-09-12T16:56:48.469Z    WARN    modules modules/chain.go:89 loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

I tried to upgrade to 1.17 as suggested at https://github.com/filecoin-project/lotus/issues/8916, but it didn't help. Is there any chance to recover from this condition? Thank you!

Logging Information

022-09-12T16:56:46.115Z INFO    badger  v2@v2.2007.3/levels.go:183  All 0 tables opened in 0s

2022-09-12T16:56:46.116Z    INFO    badger  v2@v2.2007.3/value.go:1158  Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.116Z    INFO    badger  v2@v2.2007.3/value.go:1178  Replay took: 3.572µs

2022-09-12T16:56:46.126Z    INFO    badger  v2@v2.2007.3/levels.go:183  All 0 tables opened in 0s

2022-09-12T16:56:46.128Z    INFO    badger  v2@v2.2007.3/value.go:1158  Replaying file id: 0 at offset: 0

2022-09-12T16:56:46.128Z    INFO    badger  v2@v2.2007.3/value.go:1178  Replay took: 3.369µs

ERROR: cannot dial address ws://0.0.0.0:1234/rpc/v0 for dial tcp 0.0.0.0:1234: connect: connection refused: dial tcp 0.0.0.0:1234: connect: connection refused

2022-09-12T16:56:48.022Z    INFO    badgerbs    v2@v2.2007.3/levels.go:183  All 144 tables opened in 1.88s

2022-09-12T16:56:48.239Z    INFO    badgerbs    v2@v2.2007.3/value.go:1158  Replaying file id: 186 at offset: 97039571

2022-09-12T16:56:48.464Z    INFO    badgerbs    v2@v2.2007.3/value.go:1178  Replay took: 225.549956ms

2022-09-12T16:56:48.469Z    WARN    modules modules/chain.go:89 loading chain state from disk: loading tipset: get block bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k: ipld: could not find bafy2bzacea256lxobib67owqvinrkeqd5qic6p4crsyyfblnjg6penm4h4y6k
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x20fa0f9]

goroutine 1 [running]:
github.com/filecoin-project/lotus/chain/types.(*TipSet).ParentState(...)
    /go/lotus/chain/types/tipset.go:223
github.com/filecoin-project/lotus/node/modules.NetworkName({0x7f47042b4800?, 0xc0008ed3b0?}, {0x4b4cca0?, 0xc000011380?}, 0x6?, {0x4b539f0, 0x6874aa0}, 0xc003425080?, {0xc000389180, 0xd, ...}, ...)
    /go/lotus/node/modules/chain.go:131 +0xd9
reflect.Value.call({0x3a9d6e0?, 0x4869160?, 0x2?}, {0x3db45ef, 0x4}, {0xc0008986e0, 0x7, 0x203000?})
    /usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0x4869160?, 0x6727a5?}, {0xc0008986e0, 0x7, 0x7})
    /usr/local/go/src/reflect/value.go:339 +0xbf
github.com/filecoin-project/lotus/node.as.func2({0xc0008986e0?, 0x3a052c0?, 0x10?})
    /go/lotus/node/options.go:140 +0xf0
reflect.Value.call({0x3a9d6e0?, 0xc000534930?, 0x6727a5?}, {0x3db45ef, 0x4}, {0xc0008c8370, 0x7, 0x30?})
    /usr/local/go/src/reflect/value.go:556 +0x845
reflect.Value.Call({0x3a9d6e0?, 0xc000534930?, 0x672b07?}, {0xc0008c8370, 0x7, 0x7})
    /usr/local/go/src/reflect/value.go:339 +0xbf
go.uber.org/dig.defaultInvoker({0x3a9d6e0?, 0xc000534930?, 0xc0004e8e70?}, {0xc0008c8370?, 0x7?, 0x4b6fd58?})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:355 +0x28
go.uber.org/dig.(*node).Call(0xc00083a140, {0x4b6fd58?, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:806 +0x259
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x39180c0}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3a9d7e0}, {0xc0004e8cb0, 0x7, 0x7}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080db80, {0x4b6fd58?, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3b2bf20}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x3986300}, {0xc0004c6e40, 0x2, 0x2}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*node).Call(0xc00080cd20, {0x4b6fd58?, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:797 +0xff
go.uber.org/dig.paramSingle.Build({{0x0, 0x0}, 0x0, {0x4b7da88, 0x3c06c40}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:245 +0x242
go.uber.org/dig.paramList.BuildList({{0x4b7da88, 0x394b380}, {0xc00041e7a0, 0x1, 0x1}}, {0x4b6fd58, 0xc003232af0})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/param.go:201 +0xb9
go.uber.org/dig.(*Container).Invoke(0xc003232af0, {0x394b380?, 0xc0012bf600}, {0x189153a?, 0x1?, 0x1?})
    /go/pkg/mod/go.uber.org/dig@v1.12.0/dig.go:503 +0x2b9
go.uber.org/fx.(*App).executeInvoke(0xc00346dad0, {{0x394b380, 0xc0012bf600}, {0xc003457a40, 0x7, 0x8}})
    /go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:964 +0x39f
go.uber.org/fx.(*App).executeInvokes(...)
    /go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:929
go.uber.org/fx.New({0xc000541458, 0x3, 0x1c?})
    /go/pkg/mod/go.uber.org/fx@v1.15.0/app.go:596 +0xa4b
github.com/filecoin-project/lotus/node.New({0x4b60f58, 0xc000c669f0}, {0xc0032325f0, 0x9, 0x9})
    /go/lotus/node/builder.go:361 +0x477
main.glob..func5(0xc000c68700)
    /go/lotus/cmd/lotus/daemon.go:317 +0x1609
github.com/urfave/cli/v2.(*App).RunAsSubcommand(0xc000583ba0, 0xc000c68200)
    /go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:495 +0xaff
github.com/urfave/cli/v2.(*Command).startApp(0x64ee5e0, 0xc000c68200)
    /go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:287 +0x77b
github.com/urfave/cli/v2.(*Command).Run(0xc0000cc140?, 0xc0000cc140?)
    /go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/command.go:95 +0xba
github.com/urfave/cli/v2.(*App).RunContext(0xc000583860, {0x4b60ee8?, 0xc000128000}, {0xc000126000, 0x2, 0x2})
    /go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:341 +0xbc8
github.com/urfave/cli/v2.(*App).Run(...)
    /go/pkg/mod/github.com/urfave/cli/v2@v2.8.1/app.go:247
github.com/filecoin-project/lotus/cli.RunApp(0x39efa00?)
    /go/lotus/cli/helper.go:35 +0x4e
main.main()
    /go/lotus/cmd/lotus/main.go:111 +0x90c

Repo Steps

  1. Run lotus
  2. Run out of disk space
  3. See error
TippyFlitsUK commented 2 years ago

Can you elaborate on why you see this as being an issue @ArseniiPetrovich? It is not a surprise to me that running out of chain disk space would result in chain corruption and maintaining disk space is something that needs to be monitored to avoid. It can also be easily resolved by importing a new lightweight snapshot.

ArseniiPetrovich commented 2 years ago

@TippyFlitsUK not so easy for an archival nodes that have all the chain state :) Sure, disk space need to be monitored and it's purely our fault that we overlooked this alert in our systems. However, chain corruption when having a lack of disk space still have to be considered as a bug, at least from my point of view, no matter "surprise" it or not, because it makes even a simple mistake to have great consequences. Can't we verify the available space before writing there or at least deploy a kind of recovery tool that allows you to rollback to several blocks behind the chain and resync?

TippyFlitsUK commented 2 years ago

Thanks for the clarification @ArseniiPetrovich! Agreed that this presents a far bigger problem with archival nodes. I don't agree that represents a bug though. Can you please file a new ticket using the enhancement request form and provide the additional info requested. Many thanks! :pray: