Closed benjaminh83 closed 3 years ago
I have the full log file for the complete miner run - start to crash (if needed by the devs)
@benjaminh83 thanks for bug report. Please do post the full log file here.
I think what's happening is
Step 4 could be because eg a block has arrived while the cancel message is being processed, and graphsync does a read on the blockstore to check if the block is a duplicate.
Currently there's no way to know when the event queue has completely drained after cancelling the graphsync context. So at the moment we already have hacks in place to wait 100ms after cancelling the graphsync context when restarting a data transfer.
I suggest we add a Cancel
method to the graphsync interface, that cancels the request and drains event queues etc before returning. That would allow us to take care of this panic and also get rid of these hacks: https://github.com/ipfs/go-graphsync/issues/184
I had the same error message occur overnight on my split miner lotus-miner process (the Miner/Storage/Sealing miner stayed up and operational.)
2021-08-01T22:24:47.178-0400 WARN dt_graphsync graphsync/graphsync.go:209 graphsync error: Request Failed - Unknown Reason
2021-08-01T22:24:47.178-0400 INFO graphsync requestmanager/requestmanager.go:420 graphsync request complete {"request id": 11, "peer": "12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien", "total time": 2149.050854644}
panic: must not use a read-write blockstore after finalizing
goroutine 711 [running]:
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).panicIfFinalized(...)
/home/tbtminer/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:289
github.com/ipld/go-car/v2/blockstore.(*ReadWrite).Get(0xc02a1be8f0, 0xc0a04c2ff0, 0x24, 0x18, 0xc014387940, 0x563fa6, 0xc0383a8fa8)
/home/tbtminer/go/pkg/mod/github.com/ipld/go-car/v2@v2.0.0-beta1.0.20210721090610-5a9d1b217d25/blockstore/readwrite.go:377 +0x9f
github.com/ipfs/go-graphsync/storeutil.LoaderForBlockstore.func1(0x3862b10, 0xc031649420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3862b10, ...)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/storeutil/storeutil.go:22 +0x5e
github.com/ipfs/go-graphsync/requestmanager/asyncloader.setupAttemptQueue.func1(0xc00000000b, 0x3862b10, 0xc031649420, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:359 +0x19b
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).AttemptLoad(0xc01e1017c0, 0xc00000000b, 0x3862b10, 0xc031649420, 0xc029be76e0, 0x1)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:49 +0x63
github.com/ipfs/go-graphsync/requestmanager/asyncloader/loadattemptqueue.(*LoadAttemptQueue).RetryLoads(0xc01e1017c0)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/loadattemptqueue/loadattemptqueue.go:83 +0xa8
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*newResponsesAvailableMessage).handle(0xc017927ce0, 0xc01410a780)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:336 +0x396
github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).run(0xc01410a780)
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:201 +0x44
created by github.com/ipfs/go-graphsync/requestmanager/asyncloader.(*AsyncLoader).Startup
/home/tbtminer/go/pkg/mod/github.com/ipfs/go-graphsync@v0.6.6/requestmanager/asyncloader/asyncloader.go:68 +0x65
Think this is something deal related external to our miners. If you look at the time stamps, both Benjamin and I crashed at the same time. I'm EDT, so it occurred at 2:24 AM UTC, I think Benjamin is an hour or two ahead of UTC and his crashed at 4:24 Local Time. Also verified that TippyFlits crashed at 2:24 AM UTC.
@Meatball13 please provide your full logs, at least for a couple of hours before the crash
Unfortunately I was half asleep when I noticed it was down and restarted them without keeping the logs. Seems like Stuberman also had the same issue, at the same time, and it might have been related to a big batch of deals that came out to all of us at that time. Discussion ongoing here: https://filecoinproject.slack.com/archives/C029ETPJ6BB/p1627877131340100
Thanks for the report. Resolved in m1.3.4 after merging https://github.com/filecoin-project/go-data-transfer/pull/229 and https://github.com/ipld/go-car/pull/195.
Checklist
Lotus component
lotus miner - mining and block production
Lotus Tag and Version
Describe the Bug
Seems like I experienced some instability with M1.3.3 Basically the lotus-miner process (I only have one, as currently running monolith) crashed at 4.30AM this morning, only 1.5h from my first deadline. It seems to have something to do with this, but it a pretty big problem that the failure leads to a total crash of the miner. I think this is the key event:
Logging Information
Repo Steps
Run M1.3.3 in a monolith setup. Multiple incoming (more or less faulty deals from Estuary)