filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.25k forks source link

[Deal Making Issue] Retrieval blocks forever in ResponderPaused state #5901

Closed ghost closed 3 years ago

ghost commented 3 years ago

Basic Information Here I describe a reproducible retrieval failure in which a previously stored CID (verified, fast retrieval, 32 GiB) gets "stuck" during retrieval in a ResponderPaused state.

The indefinite hang appears on the client, but I've also included logs from the miner to help debug.

Describe the problem

Here's the info needed to reproduce the problem:

Here's the problem as I observe it. When I try to retrieve this CID from a full node on another machine, the retrieval hangs forever at this point:

$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
> Recv: 0 B, Paid 0 FIL, ClientEventDealProposed (DealStatusWaitForAcceptance)
> Recv: 0 B, Paid 0 FIL, ClientEventDealAccepted (DealStatusAccepted)

Running lotus client list-transfers gives this output during the hang:

$ lotus client list-transfers
Sending Channels

Receiving Channels

ID  Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
4   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
$ lotus-miner retrieval-deals list
12D3KooWNvDJE1PVFxEQY2kmF5Y92YgLb3pBgpkBJy65DH1TaFNo 12584  ...cy7rxsui DealStatusCompleted  2       65536     
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 108   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 120   ...skcayvvc DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 130   ...skcayvvc DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 133   ...skcayvvc DealStatusFundsNeeded 2       2755818467  
12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6 1    ...2n4oe546 DealStatusCancelled  2       1102797   Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 114   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 115   ...duzwtsw6 DealStatusCancelled  2       1038144461  Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 116   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 122   ...wdrdo246 DealStatusErrored   2       0      reading piece from sealed sector: closing
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 126   ...wdrdo246 DealStatusCancelled  2       0      Client cancelled retrieval
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 138   ...wdrdo246 DealStatusUnsealed   2       0       
12D3KooWSrVTGNQP7UaXJYB8r6a2bfEcHzt2gnrHrrqwu5zYK1AD 1409  ...86jZhi7Q DealStatusCompleted  2       125032704   
12D3KooWSkFT2s45qQXRFo83GkcyFRm6jnxtWjU2VVSZPX4jGAhC 107   ...duzwtsw6 DealStatusErrored   2       6216279044  deal data transfer stalled (peer hungup)

Version

Client: lotus version 1.5.3-rc2+mainnet+git.9afb5ff94

Miner: also 1.5.3, but built from master so the version string is wrong. The build has all the merged PRs in 1.5.3.

Setup Miner hardware unknown.

To Reproduce Repro steps are above. This probably should be reproducible from any full node client.

Deal status

$ lotus client list-transfers -v
Sending Channels

Receiving Channels

ID  Status           Receiving From                                        Root Cid                                                        Initiated?  Transferred  Voucher
4   ResponderPaused  12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f  bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546  Y           1.052MiB     {"PayloadCID":{"/":"bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546"},"ID":3,"Selector":{"Raw":"oWFSomFsoWRub25loGI6PqFhYaFhPqFhQKA="},"PieceCID":null,"PricePerByte":"2","PaymentInterval":1048576,"PaymentIntervalIncrease":1048576,"UnsealPrice":"0"}

Lotus daemon and miner logs

Initially, right after the lotus client retrieve command was issued, we saw this in the logs:

il>} todo:[]} {allocated:{memUsedMin:0 memUsedMax:0 gpuUsed:false cpuUse:0 cond:<nil>} todo:[]}]
2021-03-29T18:08:02.202+0200    DEBUG   advmgr  sector-storage/sched.go:450 SCHED Acceptable win: [[] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] [] []]
2021-03-29T18:08:06.467+0200    INFO    miner   miner/miner.go:383  Time delta between now and our mining base: 6s (nulls: 0)
2021-03-29T18:08:27.971+0200    INFO    markets loggers/loggers.go:30   retrieval provider event    {"name": "ProviderEventUnsealComplete", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusUnsealed", "message": ""}
2021-03-29T18:08:27.971+0200    INFO    dt-impl impl/impl.go:370    resume channel 12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-4
2021-03-29T18:08:27.972+0200    INFO    markets loggers/loggers.go:30   retrieval provider event    {"name": "ProviderEventBlockSent", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusOngoing", "message": ""}
2021-03-29T18:08:27.976+0200    INFO    markets loggers/loggers.go:30   retrieval provider event    {"name": "ProviderEventBlockSent", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusOngoing", "message": ""}
2021-03-29T18:08:27.980+0200    INFO    markets loggers/loggers.go:30   retrieval provider event    {"name": "ProviderEventPaymentRequested", "deal ID": "3", "receiver": "12D3KooWGLRd5b3zzuJRHgodQz5xJnJ2ETv2h5vHGx849RNhBqP6", "state": "DealStatusFundsNeeded", "message": ""}
2021-03-29T18:08:31.138+0200    WARN    storagemarket_impl  impl/provider.go:517    failed to write deal status response: stream reset
2021-03-29T18:08:36.273+0200    INFO    miner   miner/miner.go:383  Time delta between now and our mining base: 6s (nulls: 0)
2021-03-29T18:09:01.598+0200    DEBUG   advmgr  sector-storage/sched.go:354 SCHED 35 queued; 40 open windows

Here is the full log spanning the entire time period in question.

lotus-miner.log.zip

Code modifications

No source code modifications.

ghost commented 3 years ago

cc @dirkmc @pooja

dirkmc commented 3 years ago

Thanks for the detailed bug report @mgoelzer ❤️

This log line suggests that there was a connection error between client and provider:

2021-03-29T18:08:31.138+0200    WARN    storagemarket_impl  impl/provider.go:517    failed to write deal status response: stream reset

We are doing some work at the moment to improve connectivity issues, basically the client should try to reconnect to the provider when there's a connectivity problem.

This work has landed in release v1.2.3 of go-fil-markets for storage, and is in progress for retrieval.

dirkmc commented 3 years ago

Getting stuck in ResponderPaused may also be a symptom of this underlying issue: https://github.com/filecoin-project/go-data-transfer/issues/184

dirkmc commented 3 years ago

@mgoelzer points out this is reproducible so it's unlikely to be caused by intermittent connection issues

dirkmc commented 3 years ago

@mgoelzer One possibility is that your client is getting stuck trying to create a payment channel.

Could you check for stuck messages in your local mpool:

./lotus mpool pending --local

Could you also run the following to increase the logging on your client:

lotus log set-level --system dt-impl debug
lotus log set-level --system dt_graphsync debug
lotus log set-level --system markets debug
lotus log set-level --system data_transfer_network debug

Try both of these, depending on your version one of them should work:

lotus log set-level --system dt-pushchanmon debug
lotus log set-level --system dt-chanmon debug
dirkmc commented 3 years ago

@mgoelzer I was able to retrieve the deal successfully from my client. I'm running the staging/minerx branch so that may have a fix in it that you don't have on your client?

ghost commented 3 years ago

Ok, some new testing results onthis.

First tried building the tip of master (version tag lotus version 1.5.3+mainnet+git.358773e2b): same result as original issue description

Then tried building minerx/staging (also version tag lotus version 1.5.3+mainnet+git.358773e2b). But I got this error:

$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 2, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal

Doing list-transfers shows a bunch of stalled retrievals with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f:

$ lotus client list-transfers
Sending Channels

Receiving Channels

ID  Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
3   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
4   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
5   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
6   ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}

Next step: blow away my .lotus directory and try again with the minerx branch...

Unless there is a way to kill a phantom transfer while Lotus thinks it's in progress?

ghost commented 3 years ago

existing deal must be cancelled before starting a new retrieval deal implies there is actually some way to cancel in-progress retrievals. But I can't find anything in the CLI help that looks like it would do this.

dirkmc commented 3 years ago

@mgoelzer does lotus client cancel-retrieval --deal-id=2 fix the problem?

ghost commented 3 years ago

@dirkmc Yes, your cancel-transfer command let me cancel all my transfers. I then repeated but hit the same problem.

But now I can't cancel the transfers anymore either! Before they had incremental integer ids like 1, 2, 3, 4, etc. Now it is a huge number and I get ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid when trying to cancel.

I'm using the tip of master (lotus version 1.7.0-dev+mainnet+git.cf4128fc7). The reason I am trying to cancel the transfer is to downgrade to minerx/staging, which I think is based on 1.5.3. I'll try the downgrade anyway...

mwg@threadripper:~$ lotus client list-transfers
Sending Channels

Receiving Channels

ID                   Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
1617328731745316884  ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}
mwg@threadripper:~$ lotus client cancel-retrieval --deal-id=1617328731745316884
ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid

mwg@threadripper:~$ lotus client list-transfers
Sending Channels

Receiving Channels

ID                   Status           Receiving From  Root Cid     Initiated?  Transferred  Voucher
1617328731745316884  ResponderPaused  ...XqC2kG9f     ...2n4oe546  Y           1.052MiB     ...lIncrease":1048576,"UnsealPrice":"0"}

Do you think id values like 1617328731745316884 is a potential separate bug in 1.7.x? Or an intentional change?

ghost commented 3 years ago

Tested with the staging/minerx branch. Even weirder result now:

mwg@threadripper:~/lotus$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 6, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal

mwg@threadripper:~/lotus$ lotus client list-transfers
Sending Channels

Receiving Channels

I tried a lotus client cancel-retrieval --deal-id=6, but still the same result as above.

ghost commented 3 years ago

@dirkmc I think we should consider closing this issue. If you were able to successfully retrieve bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 from f01240, and there's no other bug report similar to this one, then the most likely explanation is that I'm in a corrupted state of some sort. That would probably be because I keep jumping back and forth between master and minerx/staging and 1.5.3, which isn't that likely in the wild.

ghost commented 3 years ago

I did open a bug for that bigint id thing in 1.7.0: https://github.com/filecoin-project/lotus/issues/5938

dirkmc commented 3 years ago

@mgoelzer minerx/staging has a migration that isn't in v1.5.3, so switching between them is probably going to mess up the state. I am going to go ahead and close this ticket.

The transfer ID format changed in the last release - instead of using a number that is stored in state and increments, we're now using a number based on the current time. This is to help avoid problems when people remove all their state and try to make deals with the same provider. Details here: https://github.com/filecoin-project/go-data-transfer/pull/169

In the next markets release there will be a similar change for the deal ID. This is not in any lotus branch yet, so it's safest not to wipe state at the moment.