filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Standalone market node crashed due to failed to addPiece seg fault #7111

Closed jennijuju closed 3 years ago

jennijuju commented 3 years ago

Checklist

Lotus component

lotus miner/market - storage deal

Lotus Version

m1.3.5

Describe the Bug

The market process crashed today with seg fault. 22 ongoing data transfer.

Aug 18 03:17:33 c2 lotus-miner-x[141673]: 2021-08-18T03:17:33.681Z        ERROR        storageadapter        storageadapter/provider.go:114        failed to addPiece for deal 2294107, err: normal shutdown of state machine
Aug 18 03:17:33 c2 lotus-miner-x[141673]: unexpected fault address 0x7f6ac3087000
Aug 18 03:17:33 c2 lotus-miner-x[141673]: fatal error: fault
Aug 18 03:17:33 c2 lotus-miner-x[141673]: [signal SIGSEGV: segmentation violation code=0x1 addr=0x7f6ac3087000 pc=0x5ca5af]

Logging Information

https://gist.github.com/jennijuju/9f80860874646dfbf725e5b1e13d30ea

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...' ...
aarshkshah1992 commented 3 years ago

Miner logs:

Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.993Z        INFO        sectors        storage-sealing/input.go:241        Adding piece for deal 2295942 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.994Z        INFO        sectors        storage-sealing/input.go:241        Adding piece for deal 2295943 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        ERROR        sectors        storage-sealing/input.go:383        sector {1049918 535} rejected deal bafy2bzaceajex5a22qxtcjcxzo2qf6eo5c7th5kf4w6d6bumheycwq2dbfn3k: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        WARN        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279        error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        INFO        sectors        storage-sealing/input.go:241        Adding piece for deal 2295944 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        ERROR        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321        error and res returned        {"request": {"jsonrpc":"2.0","id":32,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        ERROR        sectors        storage-sealing/input.go:383        sector {1049918 535} rejected deal bafy2bzacecr7inctrkkhmhcp6q4q55bpb45scrmrjevhqkif7oasxszpavnli: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        INFO        sectors        storage-sealing/input.go:241        Adding piece for deal 2295945 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        WARN        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279        error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        INFO        sectors        storage-sealing/input.go:241        Adding piece for deal 2295941 (publish msg: bafy2bzaced34zpuhcjtidaevp56bg7s23s43otnud7osevqcbwba5stmltqfy)
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.995Z        ERROR        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321        error and res returned        {"request": {"jsonrpc":"2.0","id":33,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        ERROR        sectors        storage-sealing/input.go:383        sector {1049918 535} rejected deal bafy2bzacedjhqomttwlkb43us5dhmn4sch7ocnk6hcjx65gsejlt3qezahsrk: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        WARN        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279        error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        ERROR        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321        error and res returned        {"request": {"jsonrpc":"2.0","id":34,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        ERROR        sectors        storage-sealing/input.go:383        sector {1049918 535} rejected deal bafy2bzacecoha5l5ahdp4ho42ctga7acfq5leuxarhl74gcecizswyko3ex4q: normal shutdown of state machi>
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        WARN        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279        error in RPC call to 'Filecoin.SectorAddPieceToAny': normal shutdown of state machine:
Aug 18 10:15:00 c2 lotus-miner-x[141498]:     github.com/filecoin-project/go-statemachine.init
Aug 18 10:15:00 c2 lotus-miner-x[141498]:         /home/shane/go/pkg/mod/github.com/filecoin-project/go-statemachine@v1.0.1/machine.go:16
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.996Z        ERROR        rpc        go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:321        error and res returned        {"request": {"jsonrpc":"2.0","id":35,"method":"Filecoin.SectorAddPiec>
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.998Z        WARN        advmgr        sector-storage/manager.go:279        stub NewSector
Aug 18 10:15:00 c2 lotus-miner-x[141498]: 2021-08-18T10:15:00.998Z        INFO        sectors        storage-sealing/input.go:424        Creating sector        {"number": "539", "type": "deal", "proofType": 8}
aarshkshah1992 commented 3 years ago

@jacobheun @raulk

After some digging, here's what I have:

I'd love to have @magik6k 's eyes on this one.

raulk commented 3 years ago

What seems to be happening here is:

  1. Markets process calls SectorBlocks#AddPiece, which calls SectorBuilder#AddPieceToAny. This is initiated as part of the deal handoff, which calls the ProviderNodeAdapter#OnDealComplete method.
  2. I suspect HandoffDeal is going down the path of using the inbound CARv2 file: https://github.com/filecoin-project/go-fil-markets/blob/d3de422f4386fb444c5719e5e2297232d7aaa3dc/storagemarket/impl/providerstates/provider_states.go#L320. We open that file using the carv2 library, which mmaps the file. That's where the mmap entries come from in the stacktrace.
  3. Two things can be happening here:
    1. a condition occurs whereby the mmap is closed, but the JSON-RPC io.Reader encoder keeps reading from it and thus the OS raises a SIGSEGV.
    2. the CARv2 is corrupted and has an incorrect data offset that points to beyond the mmapped region, thus the first read panics immediately. How could this happen? Perhaps the blockstore was never finalized?
raulk commented 3 years ago

I think we're seeing 3.i.

raulk commented 3 years ago

Related fix, but non-causal: https://github.com/filecoin-project/lotus/pull/7135

To fix the warnings error and res returned shown in the miner log.

raulk commented 3 years ago

Trying to figure out what circumstance leads to the normal shutdown of state machine on the miner side, which then ends up interrupting the piece transfer and causing this panic on the markets node.

This message is ambiguous.

What's happening is likely the latter.

Copying from @magik6k:

So one issue is that we don't have any handling logic in storage-fsm for the AddPieceFailed state Another one is that, probably, when those sectors are in the AddPiece state, they can be assigned new pieces, and can still get additional SectorAddPiece events, which we do for increased packing efficiency. But.. When the sector gets into AddPieceFailed, we don't unregister the sector from Sealing.openSectors (map of sectors which are currently accepting deal data). This normally happens at the top of handlePacking (state handler for the Packing state, which is the first stage of the sealing pipeline adding all the null padding required; but when AddPiece fails we don't ever get there) Because the sector is still in the openSectors map while in AddPieceFailed, it will get SectorAddPiece events which will result in statemachine shutdown (because the AddPieceFailed doesn't expect any events)

raulk commented 3 years ago

There are two things to debug here:

  1. Why the state machine got into this condition? This is not the object of this issue.
  2. Why was the io.Reader json-rpc push not interrupted before SectorAddPieceToAny returned? That's what led the markets process to keep reading after closure.

This issue focuses on 2.

raulk commented 3 years ago

The reporter was using Lotus m1.3.5, which uses go-car/v2.0.2, which DOES NOT PANIC on reading from the SectionReader after closure.

In fact, I just discovered that reading from an mmap after closure will never panic, and will instead return an mmap: closed error.

https://github.com/golang/exp/blob/cf631fceb1a5cdfa1e4c7b6e352dff05d648121e/mmap/mmap_unix.go#L66-L69

raulk commented 3 years ago

Therefore I'm almost certain that this is a corrupted CAR. I'm going to ask the miner to try to send us the CAR.

EDIT: unfortunately we are using fstmp files in m1.3.5, so there's no way to identify which file was the offender :-(

raulk commented 3 years ago

So this is certainly a bad DataOffset or DataSize in the CARv2 header. This could happen because:

  1. ~We did not finalize the blockstore -- this is highly unlikely because we always finalize when verifying the received transfer. If the finalization fails, we fail the deal.~ (see below)
  2. There's some CARv2 bug, possibly on resumption?
    1. Counterargument: no other storage provider has reported this. This is literally an n=1 and randomly repetitive.
  3. ~The staging CARv2 file was deleted while it was being read via mmap, thus leading to the segmentation fault which golang.org/x/exp/mmap wasn't able to prevent because it wasn't a clean closure via the API.~ (see below)
    1. Miner claims the file was not deleted manually.
    2. The only piece of logic that deletes the file in m1.3.5 writeis TerminateBlockstore in go-fil-markets, which we only call when failing a deal. And I don't see journal events or log entries for deal failure.
raulk commented 3 years ago

Possibility 3 is discarded. Even if the file is deleted while mmapped, the file descriptor stays open and it stays accessible. Test:

import (
    "io/ioutil"
    "os"
    "path/filepath"
    "testing"

    "github.com/stretchr/testify/require"
    "golang.org/x/exp/mmap"
)

func TestMmapDelete(t *testing.T) {
    dir := t.TempDir()
    b := make([]byte, 16386)

    path := filepath.Join(dir, "foo.tst")

    err := ioutil.WriteFile(path, b, 0666)
    require.NoError(t, err)

    ra, err := mmap.Open(path)
    require.NoError(t, err)

    n, err := ra.ReadAt(b[:1], 1)
    require.NoError(t, err)
    require.EqualValues(t, 1, n)

    err = os.Remove(path)
    require.NoError(t, err)

    n, err = ra.ReadAt(b[:1], 100)
    require.EqualValues(t, 1, n)
    require.NoError(t, err)
}
raulk commented 3 years ago

Possibility 1 is discarded. Looking through the journal events, the deal that appears to trigger the segmentation fault has a "verified" event:

{"System":"markets/storage/provider","Event":"state_change","Timestamp":"2021-08-15T23:24:38.53053366Z","Data":{"Event":"ProviderEventVerifiedData","Deal":{"Proposal":{"PieceCID":{"/":"baga6ea4seaqmupgkvwfjw4ohkh4lryjirssq5smawnux2d62qqgn6kxqu4shumi"},"PieceSize":1073741824,"VerifiedDeal":true,"Client":"f3vnq2cmwig3qjisnx5hobxvsd4drn4f54xfxnv4tciw6vnjdsf5xipgafreprh5riwmgtcirpcdmi3urbg36a","Provider":"f01049918","Label":"Qmcnj4v7YKYyq6B3DpoyVM9SLPnWVmA59iVqQsWDi78cHN","StartEpoch":1045606,"EndEpoch":2540326,"StoragePricePerEpoch":"0","ProviderCollateral":"196040951948034","ClientCollateral":"0"},"ClientSignature":{"Type":2,"Data":"t/UsdWN2VWnKoAp5/APDjxqAUbySlP9rs09oGcEK9P/Bc2B+vr2/M/Cp7sR0hpdNDZrUmbFJ7FCA21p4mCYNn/XX9l3I2QD4zIew3bzLvawo4rRnv9JL/4Qpy0IcAaNb"},"ProposalCid":{"/":"bafyreic7xvflyzkfs72nzsighdemudo4f322bnicapovoqfabinx5ranxi"},"AddFundsCid":null,"PublishCid":null,"Miner":"12D3KooWA1o2Ay3Lf4Lizj1iyEPDpd8k6N8WHk4ZXJec48B5GcZy","Client":"12D3KooWCVXs8P7iq6ao4XhfAmKWrEeuKFWCJgqe9jGDMTqHYBjw","State":20,"PiecePath":"","MetadataPath":"","SlashEpoch":0,"FastRetrieval":true,"Message":"","FundsReserved":"0","Ref":{"TransferType":"graphsync","Root":{"/":"Qmcnj4v7YKYyq6B3DpoyVM9SLPnWVmA59iVqQsWDi78cHN"},"PieceCid":null,"PieceSize":0,"RawBlockSize":0},"AvailableForRetrieval":false,"DealID":0,"CreationTime":"2021-08-15T23:23:10.419769071Z","TransferChannelId":{"Initiator":"12D3KooWGBWx9gyUFTVQcKMTenQMSyE2ad9m7c9fpjS4NMjoDien","Responder":"12D3KooWA1o2Ay3Lf4Lizj1iyEPDpd8k6N8WHk4ZXJec48B5GcZy","ID":1629050718006020143},"SectorNumber":0,"InboundCAR":"/mnt/md0/minerx-market/tmp/fstmp878096469"}}}
raulk commented 3 years ago

The only possibilities that are left are:

  1. DataOffset and DataSize corruption on disk, somehow.
  2. Defective/corrupted memory. These are very extraneous events, but they do happen, and no other miner has reported this issue, so it's literally a n=1. The single miner that reported this issue also complains that it happens recurringly, which may point to defective RAM or disk.
raulk commented 3 years ago

Analysis indicates there is a potential hardware defect at play here. Data points:

I've requested the miner to run memory and disk diagnostics. This is a very unsatisfying closure, but there's nothing else we can do here now.