filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.84k stars 1.26k forks source link

Lotus daemon being killed and causing websocket errors #10050

Closed davidgasquez closed 3 weeks ago

davidgasquez commented 1 year ago

Checklist

Lotus component

Lotus Version

Daemon `v1.18.0`-

Describe the Bug

We are running some jobs on non-bootstrapped nodes using Lily. Some of these nodes usually get the following message.

"level":"debug","ts":"2023-01-02T23:55:04.987Z","logger":"rpc","caller":"go-jsonrpc@v0.1.8/websocket.go:624","msg":"websocket error","error":"websocket: close 1000 (normal)"}

The lily daemon gets killed in this case. I think @TippyFlitsUK has been able to reproduce it in Lotus.

This also appears in the Lily logs:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xd4cbda]

goroutine 28318710 [running]:
github.com/filecoin-project/go-amt-ipld/v2.(*Node).forEachAt(0xc09bc7a510, {0x56150f0?, 0xc0001a2000?}, {0x7fe57c6d2910?, 0xc0d461e140?}, 0xc696ac6000?, 0x0, 0x0, 0xc5ef75dba8)
        /go/pkg/mod/github.com/filecoin-project/go-amt-ipld/v2@v2.1.1-0.20201006184820-924ee87a1349/amt.go:270 +0x19a
github.com/filecoin-project/go-amt-ipld/v2.(*Root).ForEach(...)
        /go/pkg/mod/github.com/filecoin-project/go-amt-ipld/v2@v2.1.1-0.20201006184820-924ee87a1349/amt.go:257
github.com/filecoin-project/specs-actors/v2/actors/util/adt.(*Array).ForEach(0xc696989848?, {0x55fbfa0?, 0xc696ac6000?}, 0xc5ef75dc08?)
        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.6/actors/util/adt/array.go:81 +0xc7
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*deadline2).ForEachPartition(0xc689bcd550, 0xd763f68d80)
        /build/lily/chain/actors/builtin/miner/v2.go:535 +0xc7
github.com/filecoin-project/lily/tasks/actorstate/miner.LoadSectorState.func1(0x4047220?, {0x5619700, 0xc689bcd550})
        /build/lily/tasks/actorstate/miner/sector_events.go:302 +0xd8
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*state2).ForEachDeadline.func1(0xc001566900?, 0xc40cd01860)
        /build/lily/chain/actors/builtin/miner/v2.go:345 +0xe2
github.com/filecoin-project/specs-actors/v2/actors/builtin/miner.(*Deadlines).ForEach(0xc08f360460?, {0x7fe64427ff48, 0xc0d461e140}, 0xc5ef75dd30)
        /go/pkg/mod/github.com/filecoin-project/specs-actors/v2@v2.3.6/actors/builtin/miner/deadline_state.go:89 +0x7c
github.com/filecoin-project/lily/chain/actors/builtin/miner.(*state2).ForEachDeadline(0xc08f360460, 0xd48934ede0)
        /build/lily/chain/actors/builtin/miner/v2.go:344 +0xd6
github.com/filecoin-project/lily/tasks/actorstate/miner.LoadSectorState({0x56150b8, 0xc6079230c0}, {0x5639c30, 0xc08f360460})
        /build/lily/tasks/actorstate/miner/sector_events.go:301 +0x222
github.com/filecoin-project/lily/tasks/actorstate/miner.DiffMinerSectorStates.func2()
        /build/lily/tasks/actorstate/miner/sector_events.go:383 +0x5f
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /go/pkg/mod/golang.org/x/sync@v0.0.0-20220722155255-886fb9371eb4/errgroup/errgroup.go:72 +0xa5

@rvagg left a comment on that in the original issue: https://github.com/filecoin-project/filet/issues/22#issuecomment-1383295745.

Logging Information

Shared before.

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...' ...
TippyFlitsUK commented 1 year ago

Many thanks @davidgasquez

I can confirm that I was able to reproduce this error locally. My initial suspicion was a resource usage limitation but after testing on a 64 core, 1 TiB RAM system the issue is still present.

davidgasquez commented 1 year ago

Is there anything I can try/do to help add more information to this one? We're running lily to index the chain and this causes jobs to fail. Recent jobs, being bigger, have more chances to fail due this error.

frrist commented 1 year ago

@TippyFlitsUK could you share the steps you followed to reproduce this?

TippyFlitsUK commented 1 year ago

Hey Forrest 👋 Simply ran the following command on different spec servers: lotus chain export --recent-stateroots=2880 test.chain

davidgasquez commented 1 year ago

Any updates on this one @TippyFlitsUK? Basically, I'm curious if this is something that will get worked on soon. :sweat_smile: No worries if not as Forrest is patching things up on Lily's side but still wanted to check.