Closed jennijuju closed 3 years ago
Miner logs:
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.993Z INFO sectors storage-sealing/input.go:241 Adding piece for deal 2295942 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.994Z INFO sectors storage-sealing/input.go:241 Adding piece for deal 2295943 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z ERROR sectors storage-sealing/input.go:383 sector {1049918 535} rejected deal bafy2bzaceajex5a22qxtcjcxzo2qf6eo5c7th5kf4w6d6bumheycwq2dbfn3k: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z WARN rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279 error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z INFO sectors storage-sealing/input.go:241 Adding piece for deal 2295944 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z ERROR rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321 error and res returned {"request": {"jsonrpc":"2.0","id":32,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z ERROR sectors storage-sealing/input.go:383 sector {1049918 535} rejected deal bafy2bzacecr7inctrkkhmhcp6q4q55bpb45scrmrjevhqkif7oasxszpavnli: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z INFO sectors storage-sealing/input.go:241 Adding piece for deal 2295945 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z WARN rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279 error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z INFO sectors storage-sealing/input.go:241 Adding piece for deal 2295941 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z ERROR rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321 error and res returned {"request": {"jsonrpc":"2.0","id":33,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z ERROR sectors storage-sealing/input.go:383 sector {1049918 535} rejected deal bafy2bzacedjhqomttwlkb43us5dhmn4sch7ocnk6hcjx65gsejlt3qezahsrk: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z WARN rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279 error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z ERROR rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321 error and res returned {"request": {"jsonrpc":"2.0","id":34,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z ERROR sectors storage-sealing/input.go:383 sector {1049918 535} rejected deal bafy2bzacecoha5l5ahdp4ho42ctga7acfq5leuxarhl74gcecizswyko3ex4q: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z WARN rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279 error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]: github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]: /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z ERROR rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321 error and res returned {"request": {"jsonrpc":"2.0","id":35,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.998Z WARN advmgr sector-storage/manager.go:279 stub NewSector
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.998Z INFO sectors storage-sealing/input.go:424 Creating sector {"number": "539", "type": "deal", "proofType": 8}
@jacobheun @raulk
After some digging, here's what I have:
Miner's sealing fsm seems to have entered an invalid state and AddPiece
calls to it are failing. Miner logs above keeps showing the message Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z WARN rpc go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279 error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
and sector {1049918 535} rejected deal bafy2bzaceajex5a22qxtcjcxzo2qf6eo5c7th5kf4w6d6bumheycwq2dbfn3k: normal shutdown of state machi>
This could be if the Miner was shutting down for example. Though I'd appreciate if @magik6k has some more comments on why this could happen.
Markets makes a RPC call to the Miner to add a Piece to a Sector and as a part of the RPC call , hands it the padded CARv1 reader. That happens in the Markets code here. The paddedReader
variable there is the padded CARv1 reader that we will pass to the Miner to read the deal data from and add the piece to the Sector. Note that we're also seeing those add calls fail as seen by the Aug 18 03:17:33 c2 lotus-miner-x[141673]: 2021-08-18T03:17:33.681Z ERROR storageadapter storageadapter/provider.go:114 failed to addPiece for deal 2294107, err: normal shutdown of state machine
log lines in the Markets logs. This is in-line with all SectorAddPieceToAny
calls in the Miner failing.
One hypothesis based on the stack trace we have in the issue is that Markets lost the HTTP connection to the Miner process while copying the CARv1 reader data over to the Miner and this somehow lead to a panic. I don't know enough about the internals of the JSON-RPC API or the magic of passing a Reader over an RPC call here and how that works to dig any deeper.
I'd love to have @magik6k 's eyes on this one.
What seems to be happening here is:
HandoffDeal
is going down the path of using the inbound CARv2 file: https://github.com/filecoin-project/go-fil-markets/blob/d3de422f4386fb444c5719e5e2297232d7aaa3dc/storagemarket/impl/providerstates/provider_states.go#L320. We open that file using the carv2 library, which mmaps the file. That's where the mmap entries come from in the stacktrace.I think we're seeing 3.i.
Related fix, but non-causal: https://github.com/filecoin-project/lotus/pull/7135
To fix the warnings error and res returned
shown in the miner log.
Trying to figure out what circumstance leads to the normal shutdown of state machine
on the miner side, which then ends up interrupting the piece transfer and causing this panic on the markets node.
This message is ambiguous.
What's happening is likely the latter.
Copying from @magik6k:
So one issue is that we don't have any handling logic in storage-fsm for the AddPieceFailed state Another one is that, probably, when those sectors are in the AddPiece state, they can be assigned new pieces, and can still get additional SectorAddPiece events, which we do for increased packing efficiency. But.. When the sector gets into AddPieceFailed, we don't unregister the sector from Sealing.openSectors (map of sectors which are currently accepting deal data). This normally happens at the top of handlePacking (state handler for the Packing state, which is the first stage of the sealing pipeline adding all the null padding required; but when AddPiece fails we don't ever get there) Because the sector is still in the openSectors map while in AddPieceFailed, it will get SectorAddPiece events which will result in statemachine shutdown (because the AddPieceFailed doesn't expect any events)
There are two things to debug here:
SectorAddPieceToAny
returned? That's what led the markets process to keep reading after closure.This issue focuses on 2.
The reporter was using Lotus m1.3.5, which uses go-car/v2.0.2, which DOES NOT PANIC on reading from the SectionReader
after closure.
In fact, I just discovered that reading from an mmap after closure will never panic, and will instead return an mmap: closed
error.
Therefore I'm almost certain that this is a corrupted CAR. I'm going to ask the miner to try to send us the CAR.
EDIT: unfortunately we are using fstmp files in m1.3.5, so there's no way to identify which file was the offender :-(
So this is certainly a bad DataOffset
or DataSize
in the CARv2 header. This could happen because:
TerminateBlockstore
in go-fil-markets, which we only call when failing a deal. And I don't see journal events or log entries for deal failure.Possibility 3 is discarded. Even if the file is deleted while mmapped, the file descriptor stays open and it stays accessible. Test:
import (
"io/ioutil"
"os"
"path/filepath"
"testing"
"github.com/stretchr/testify/require"
"golang.org/x/exp/mmap"
)
func TestMmapDelete(t *testing.T) {
dir := t.TempDir()
b := make([]byte, 16386)
path := filepath.Join(dir, "foo.tst")
err := ioutil.WriteFile(path, b, 0666)
require.NoError(t, err)
ra, err := mmap.Open(path)
require.NoError(t, err)
n, err := ra.ReadAt(b[:1], 1)
require.NoError(t, err)
require.EqualValues(t, 1, n)
err = os.Remove(path)
require.NoError(t, err)
n, err = ra.ReadAt(b[:1], 100)
require.EqualValues(t, 1, n)
require.NoError(t, err)
}
Possibility 1 is discarded. Looking through the journal events, the deal that appears to trigger the segmentation fault has a "verified" event:
{"System":"markets/storage/provider","Event":"state_change","Timestamp":"2021-08-15T23:24:38.53053366Z","Data":{"Event":"ProviderEventVerifiedData","Deal":{"Proposal":{"PieceCID":{"/":"baga6ea4seaqmupgkvwfjw4ohkh4lryjirssq5smawnux2d62qqgn6kxqu4shumi"},"PieceSize":1073741824,"VerifiedDeal":true,"Client":"f3vnq2cmwig3qjisnx5hobxvsd4drn4f54xfxnv4tciw6vnjdsf5xipgafreprh5riwmgtcirpcdmi3urbg36a","Provider":"f01049918","Label":"Qmcnj4v7YKYyq6B3DpoyVM9SLPnWVmA59iVqQsWDi78cHN","StartEpoch":1045606,"EndEpoch":2540326,"StoragePricePerEpoch":"0","ProviderCollateral":"196040951948034","ClientCollateral":"0"},"ClientSignature":{"Type":2,"Data":"t/UsdWN2VWnKoAp5/APDjxqAUbySlP9rs09oGcEK9P/Bc2B+vr2/M/Cp7sR0hpdNDZrUmbFJ7FCA21p4mCYNn/XX9l3I2QD4zIew3bzLvawo4rRnv9JL/4Qpy0IcAaNb"},"ProposalCid":{"/":"bafyreic7xvflyzkfs72nzsighdemudo4f322bnicapovoqfabinx5ranxi"},"AddFundsCid":null,"PublishCid":null,"Miner":"12D3KooWA1o2Ay3Lf4Lizj1iyEPDpd8k6N8WHk4ZXJec48B5GcZy","Client":"12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw","State":20,"PiecePath":"","MetadataPath":"","SlashEpoch":0,"FastRetrieval":true,"Message":"","FundsReserved":"0","Ref":{"TransferType":"graphsync","Root":{"/":"Qmcnj4v7YKYyq6B3DpoyVM9SLPnWVmA59iVqQsWDi78cHN"},"PieceCid":null,"PieceSize":0,"RawBlockSize":0},"AvailableForRetrieval":false,"DealID":0,"CreationTime":"2021-08-15T23:23:10.419769071Z","TransferChannelId":{"Initiator":"12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien","Responder":"12D3KooWA1o2Ay3Lf4Lizj1iyEPDpd8k6N8WHk4ZXJec48B5GcZy","ID":1629050718006020143},"SectorNumber":0,"InboundCAR":"/mnt/md0/minerx-market/tmp/fstmp878096469"}}}
The only possibilities that are left are:
Analysis indicates there is a potential hardware defect at play here. Data points:
I've requested the miner to run memory and disk diagnostics. This is a very unsatisfying closure, but there's nothing else we can do here now.
Checklist
Latest release
, or the most recent RC(release canadiate) for the upcoming release or the dev branch(master), or have an issue updating to any of these.Lotus component
lotus miner/market - storage deal
Lotus Version
Describe the Bug
The market process crashed today with seg fault. 22 ongoing data transfer.
Logging Information
Repo Steps