Closed ghost closed 3 years ago
cc @dirkmc @pooja
Thanks for the detailed bug report @mgoelzer ❤️
This log line suggests that there was a connection error between client and provider:
2021-03-29T18:08:31.138+0200 WARN storagemarket_impl impl/provider.go:517 failed to write deal status response: stream reset
We are doing some work at the moment to improve connectivity issues, basically the client should try to reconnect to the provider when there's a connectivity problem.
This work has landed in release v1.2.3 of go-fil-markets for storage, and is in progress for retrieval.
Getting stuck in ResponderPaused may also be a symptom of this underlying issue: https://github.com/filecoin-project/go-data-transfer/issues/184
@mgoelzer points out this is reproducible so it's unlikely to be caused by intermittent connection issues
@mgoelzer One possibility is that your client is getting stuck trying to create a payment channel.
Could you check for stuck messages in your local mpool:
./lotus mpool pending --local
Could you also run the following to increase the logging on your client:
lotus log set-level --system dt-impl debug
lotus log set-level --system dt_graphsync debug
lotus log set-level --system markets debug
lotus log set-level --system data_transfer_network debug
Try both of these, depending on your version one of them should work:
lotus log set-level --system dt-pushchanmon debug
lotus log set-level --system dt-chanmon debug
@mgoelzer I was able to retrieve the deal successfully from my client. I'm running the staging/minerx
branch so that may have a fix in it that you don't have on your client?
Ok, some new testing results onthis.
First tried building the tip of master
(version tag lotus version 1.5.3+mainnet+git.358773e2b
): same result as original issue description
Then tried building minerx/staging
(also version tag lotus version 1.5.3+mainnet+git.358773e2b
). But I got this error:
$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 2, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal
Doing list-transfers
shows a bunch of stalled retrievals with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f
:
$ lotus client list-transfers
Sending Channels
Receiving Channels
ID Status Receiving From Root Cid Initiated? Transferred Voucher
3 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
4 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
5 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
6 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
Next step: blow away my .lotus
directory and try again with the minerx branch...
Unless there is a way to kill a phantom transfer while Lotus thinks it's in progress?
existing deal must be cancelled before starting a new retrieval deal
implies there is actually some way to cancel in-progress retrievals. But I can't find anything in the CLI help that looks like it would do this.
@mgoelzer does lotus client cancel-retrieval --deal-id=2
fix the problem?
@dirkmc Yes, your cancel-transfer
command let me cancel all my transfers. I then repeated but hit the same problem.
But now I can't cancel the transfers anymore either! Before they had incremental integer ids like 1, 2, 3, 4, etc. Now it is a huge number and I get ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid
when trying to cancel.
I'm using the tip of master (lotus version 1.7.0-dev+mainnet+git.cf4128fc7
). The reason I am trying to cancel the transfer is to downgrade to minerx/staging
, which I think is based on 1.5.3. I'll try the downgrade anyway...
mwg@threadripper:~$ lotus client list-transfers
Sending Channels
Receiving Channels
ID Status Receiving From Root Cid Initiated? Transferred Voucher
1617328731745316884 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
mwg@threadripper:~$ lotus client cancel-retrieval --deal-id=1617328731745316884
ERROR: failed to cancel retrieval deal: loadOrCreate state: saving initial state: failed to write cid field t.PayloadCID: undefined cid
mwg@threadripper:~$ lotus client list-transfers
Sending Channels
Receiving Channels
ID Status Receiving From Root Cid Initiated? Transferred Voucher
1617328731745316884 ResponderPaused ...XqC2kG9f ...2n4oe546 Y 1.052MiB ...lIncrease":1048576,"UnsealPrice":"0"}
Do you think id values like 1617328731745316884
is a potential separate bug in 1.7.x? Or an intentional change?
Tested with the staging/minerx
branch. Even weirder result now:
mwg@threadripper:~/lotus$ lotus client retrieve --miner f01240 bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 baf-oe546.bin
> Recv: 0 B, Paid 0 FIL, ClientEventOpen (DealStatusNew)
ERROR: retrieval failed: Retrieve failed: there is an active retrieval deal with peer 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f for payload CID bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546 (retrieval deal ID 6, state DealStatusAccepted) - existing deal must be cancelled before starting a new retrieval deal
mwg@threadripper:~/lotus$ lotus client list-transfers
Sending Channels
Receiving Channels
I tried a lotus client cancel-retrieval --deal-id=6
, but still the same result as above.
@dirkmc I think we should consider closing this issue. If you were able to successfully retrieve bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546
from f01240
, and there's no other bug report similar to this one, then the most likely explanation is that I'm in a corrupted state of some sort. That would probably be because I keep jumping back and forth between master
and minerx/staging
and 1.5.3
, which isn't that likely in the wild.
I did open a bug for that bigint id thing in 1.7.0: https://github.com/filecoin-project/lotus/issues/5938
@mgoelzer minerx/staging has a migration that isn't in v1.5.3, so switching between them is probably going to mess up the state. I am going to go ahead and close this ticket.
The transfer ID format changed in the last release - instead of using a number that is stored in state and increments, we're now using a number based on the current time. This is to help avoid problems when people remove all their state and try to make deals with the same provider. Details here: https://github.com/filecoin-project/go-data-transfer/pull/169
In the next markets release there will be a similar change for the deal ID. This is not in any lotus branch yet, so it's safest not to wipe state at the moment.
Basic Information Here I describe a reproducible retrieval failure in which a previously stored CID (verified, fast retrieval, 32 GiB) gets "stuck" during retrieval in a
ResponderPaused
state.The indefinite hang appears on the client, but I've also included logs from the miner to help debug.
Describe the problem
Here's the info needed to reproduce the problem:
f01240
bafykbzacea5dewvdatvbxc2tmi26bomowduqhoi7ery4yqi3n6li32n4oe546
Here's the problem as I observe it. When I try to retrieve this CID from a full node on another machine, the retrieval hangs forever at this point:
Running
lotus client list-transfers
gives this output during the hang:Version
Client:
lotus version 1.5.3-rc2+mainnet+git.9afb5ff94
Miner: also 1.5.3, but built from master so the version string is wrong. The build has all the merged PRs in 1.5.3.
Setup Miner hardware unknown.
To Reproduce Repro steps are above. This probably should be reproducible from any full node client.
Deal status
Lotus daemon and miner logs
Initially, right after the
lotus client retrieve
command was issued, we saw this in the logs:Here is the full log spanning the entire time period in question.
lotus-miner.log.zip
Code modifications
No source code modifications.