filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

lotus-miner / markets subsystem crash - keywords: 'panic: must not use a read-write blockstore after finalizing' #6968

Closed benjaminh83 closed 3 years ago

benjaminh83 commented 3 years ago

Checklist

Lotus component

lotus miner - mining and block production

Lotus Tag and Version

lotus version
Daemon:  1.11.1-rc1+mainnet+git.3be268182+api1.3.1
Local: lotus version 1.11.1-rc1+mainnet+git.3be268182

lotus-miner version
Daemon:  1.11.1-rc1+mainnet+git.3be268182+api1.2.0
Local: lotus-miner version 1.11.1-rc1+mainnet+git.3be268182

M1.3.3

Describe the Bug

Seems like I experienced some instability with M1.3.3 Basically the lotus-miner process (I only have one, as currently running monolith) crashed at 4.30AM this morning, only 1.5h from my first deadline. It seems to have something to do with this, but it a pretty big problem that the failure leads to a total crash of the miner. I think this is the key event:

2021-08-02T04:24:53.835+0200    WARN    dt_graphsync    graphsync/graphsync.go:209  graphsync error: Request Failed - Unknown Reason
panic: must not use a read-write blockstore after finalizing

Logging Information

2021-08-02T04:24:38.920+0200    INFO    net/identify    identify/id.go:376  failed negotiate identify protocol with peer    {"peer": "12D3KooWPHrBvfpQrVziZVbWFi1FiZodTD7G5AvnvMD4omot8SV9", "error": "stream reset"}
2021-08-02T04:24:43.758+0200    INFO    net/identify    identify/id.go:376  failed negotiate identify protocol with peer    {"peer": "12D3KooWJE7d5eBmCa6vYUhC7GQSF6gw7F9A1SjNG2SydKN87Wgy", "error": "stream reset"}
2021-08-02T04:24:51.053+0200    INFO    net/identify    identify/id.go:376  failed negotiate identify protocol with peer    {"peer": "12D3KooWEpFvqf6pXcejvBryHJDQSkzWdPFVxz54rHqxiJnNerP9", "error": "stream reset"}
2021-08-02T04:24:53.835+0200    INFO    graphsync   requestmanager/requestmanager.go:420    graphsync request complete  {"request id": 33, "peer": "12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien", "total time": 2080.931689718}
2021-08-02T04:24:53.835+0200    WARN    dt_graphsync    graphsync/graphsync.go:209  graphsync error: Request Failed - Unknown Reason
panic: must not use a read-write blockstore after finalizing
goroutine 53395 [running]:
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).panicIfFinalized(...)
    /home/acemgt/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:289
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).Get(0xc026e362c0, 0xc05132fbf0, 0x24, 0x18, 0xc035d23940, 0x563f66, 0xc0242fa8e0)
    /home/acemgt/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:377 +0x9f
github.com/ipfs/go-graphsync/storeutil.LoaderForBlockstore.func1(0x384fb10, 0xc132fcba20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x384fb10, ...)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/storeutil/storeutil.go:22 +0x5e
github.com/ipfs/go-graphsync/requestmanager/asyncloader.setupAttemptQueue.func1(0xc000000021, 0x384fb10, 0xc132fcba20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:359 +0x19b
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).AttemptLoad(0xc0b2848c40, 0xc000000021, 0x384fb10, 0xc132fcba20, 0xc07f05e9c0, 0x1)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:49 +0x63
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).RetryLoads(0xc0b2848c40)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:83 +0xa8
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*newResponsesAvailableMessage).handle(0xc0cc1dfc60, 0xc015386600)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:336 +0x396
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).run(0xc015386600)
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:201 +0x44
created by github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).Startup
    /home/acemgt/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:68 +0x65

Repo Steps

Run M1.3.3 in a monolith setup. Multiple incoming (more or less faulty deals from Estuary)

benjaminh83 commented 3 years ago

I have the full log file for the complete miner run - start to crash (if needed by the devs)

dirkmc commented 3 years ago

@benjaminh83 thanks for bug report. Please do post the full log file here.

benjaminh83 commented 3 years ago

miner-m1.3.3-log5.log

dirkmc commented 3 years ago

I think what's happening is

  1. Client sends cancel message to provider
  2. Provider closes context on graphsync request to cancel it (but doesn't wait for request to finish being cancelled)
  3. Provider calls Finalize() on blockstore
  4. Somehow blockstore.Get() is called

Step 4 could be because eg a block has arrived while the cancel message is being processed, and graphsync does a read on the blockstore to check if the block is a duplicate.

Currently there's no way to know when the event queue has completely drained after cancelling the graphsync context. So at the moment we already have hacks in place to wait 100ms after cancelling the graphsync context when restarting a data transfer.

I suggest we add a Cancel method to the graphsync interface, that cancels the request and drains event queues etc before returning. That would allow us to take care of this panic and also get rid of these hacks: https://github.com/ipfs/go-graphsync/issues/184

Meatball13 commented 3 years ago

I had the same error message occur overnight on my split miner lotus-miner process (the Miner/Storage/Sealing miner stayed up and operational.)

2021-08-01T22:24:47.178-0400    WARN    dt_graphsync    graphsync/graphsync.go:209  graphsync error: Request Failed - Unknown Reason
2021-08-01T22:24:47.178-0400    INFO    graphsync   requestmanager/requestmanager.go:420    graphsync request complete  {"request id": 11, "peer": "12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien", "total time": 2149.050854644}
panic: must not use a read-write blockstore after finalizing

goroutine 711 [running]:
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).panicIfFinalized(...)
    /home/tbtminer/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:289
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).Get(0xc02a1be8f0, 0xc0a04c2ff0, 0x24, 0x18, 0xc014387940, 0x563fa6, 0xc0383a8fa8)
    /home/tbtminer/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:377 +0x9f
github.com/ipfs/go-graphsync/storeutil.LoaderForBlockstore.func1(0x3862b10, 0xc031649420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3862b10, ...)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/storeutil/storeutil.go:22 +0x5e
github.com/ipfs/go-graphsync/requestmanager/asyncloader.setupAttemptQueue.func1(0xc00000000b, 0x3862b10, 0xc031649420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:359 +0x19b
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).AttemptLoad(0xc01e1017c0, 0xc00000000b, 0x3862b10, 0xc031649420, 0xc029be76e0, 0x1)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:49 +0x63
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).RetryLoads(0xc01e1017c0)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:83 +0xa8
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*newResponsesAvailableMessage).handle(0xc017927ce0, 0xc01410a780)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:336 +0x396
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).run(0xc01410a780)
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:201 +0x44
created by github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).Startup
    /home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:68 +0x65
Meatball13 commented 3 years ago

Think this is something deal related external to our miners. If you look at the time stamps, both Benjamin and I crashed at the same time. I'm EDT, so it occurred at 2:24 AM UTC, I think Benjamin is an hour or two ahead of UTC and his crashed at 4:24 Local Time. Also verified that TippyFlits crashed at 2:24 AM UTC.

dirkmc commented 3 years ago

@Meatball13 please provide your full logs, at least for a couple of hours before the crash

Meatball13 commented 3 years ago

Unfortunately I was half asleep when I noticed it was down and restarted them without keeping the logs. Seems like Stuberman also had the same issue, at the same time, and it might have been related to a big batch of deals that came out to all of us at that time. Discussion ongoing here: https://filecoinproject.slack.com/archives/C029ETPJ6BB/p1627877131340100

raulk commented 3 years ago

Thanks for the report. Resolved in m1.3.4 after merging https://github.com/filecoin-project/go-data-transfer/pull/229 and https://github.com/ipld/go-car/pull/195.