filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Client initiates transfer but transfer doesn't begin with StorageDealStartDataTransfer #4946

Closed dineshshenoy closed 3 years ago

dineshshenoy commented 3 years ago

Previously running: fe448e Upgraded to 1.2.0

Tried against the following miners: f09848 f014394 f015927 f02665

Textile was successful in deal transfers on fe448e but is running into challenges starting a data transfer on 1.2.0

Nov 20 02:03:21 bafyreifeu4k2lgizsdqmmqrbuyfow57qqihaeqgtaqnvk4cxfajtjnjuom 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.1045888 FIL 522944 false
Nov 20 02:03:21 bafyreicbcbjvwpimr2xgjov7t5fk4kvvobxfqb6oufozt7hqpoovtarz4y 0 f014394 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.1045976 FIL 522988 false
Nov 20 02:03:21 bafyreifcmj2bhm3cxw4rw22ns3wkwvjxk72njllvty2i36sumvwgeydrqq 0 f015927 StorageDealStartDataTransfer N N baga6ea4seaqpi6qnbmkqf7b7psqheuidjwfxjcqrpy22hvsg3slogbcmf72wsfq 1.984 GiB 0.0103161366 FIL 521017 false
Nov 20 02:06:21 bafyreigsjfa5ynj3v67uxnv3q5m4otvhprtka2hphiqxdj4teqak6sdf6m 0 f02665 StorageDealStartDataTransfer N N baga6ea4seaqd3w2d6namioztl4cegxnhbnvwjivzn5unxmdiegm5cuhgnu5biky 7.938 GiB 0.002090608 FIL 522652 false
Nov 20 03:03:17 bafyreibkoc54gbijomv6poceji6xrxsypy3w2ayz2vgnqwoysbyoirodam 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.1045648 FIL 522824 false
Nov 20 03:03:17 bafyreieys5gvo5kxa3vv2mq6fzslncfgvj637fisipelcn4to422nqqp2y 0 f014394 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.1045736 FIL 522868 false
Nov 20 03:03:18 bafyreignxfktgmaa66ypnqr75mreh6jv7wmgy35bcz7acmnbas4jj4kyda 0 f015927 StorageDealStartDataTransfer N N baga6ea4seaqo62sswgtnh7h324wwq4exqc7nwiq33i7rfn4a5whu3uyzncmdyea 1.984 GiB 0.0103137606 FIL 520897 false
Nov 20 03:06:21 bafyreicpnmoc2iee3zb5523rolfthkkw5xvwkwuyjvv4ubog3ckxla5roe 0 f02665 StorageDealStartDataTransfer N N baga6ea4seaqnhd5xfeuwcrndmeb5xhuum6cqfhprlxzdcetgqqjnbqrgey6mqca 7.938 GiB 0.002090128 FIL 522532 false
Nov 20 04:03:22 bafyreigqr6ifsqwycag3yn2xzicojnnvbt57w2xnwtzidgkmomqmkowv5m 0 f09848 StorageDealStartDataTransfer N N baga6ea4seaqnh3pbmml4o44nmuzb4tp3t4gv7kve2sugf4cb5yrsatshzmn2eki 1.984 GiB 0.1045408 FIL 522704 false

dirkmc commented 3 years ago

When this issue occurs, there is a single Open data transfer event and then no more events

2020-11-23T14:06:18.887Z        DEBUG        markets        loggers/loggers.go:35        data transfer event        {"name": "Open", "status": "Requested", "transfer ID": 923, "channel ID": "12D3KooWHAN67HgLoqrygQnouFF5uVhwM4kP7zurpR65Powtfgsa-12D3KooWMTh8Cz9dbUik9nQGKE5aVkUhVLfGLm48vu3gaJXQQBYQ-923", "sent": 0, "received": 0, "queued": 0, "received count": 0, "total size": 0, "remote peer": "12D3KooWMTh8Cz9dbUik9nQGKE5aVkUhVLfGLm48vu3gaJXQQBYQ", "event message": "", "channel message": ""}
kernelogic commented 3 years ago

@magik6k i don't think adding logging will solve this issue. I have a lot deals stuck in StartDataTransfer state

root@li824-220:~# lotus client deal-stats Total: 791 deals, 1.516 TiB StorageDealActive: 6 deals, 11.91 GiB StorageDealProposalAccepted: 1 deals, 1.984 GiB StorageDealSealing: 383 deals, 760 GiB StorageDealFundsReserved: 21 deals, 41.67 GiB StorageDealCheckForAcceptance: 5 deals, 9.922 GiB StorageDealStartDataTransfer: 125 deals, 237.6 GiB StorageDealTransferring: 26 deals, 51.59 GiB StorageDealClientFunding: 1 deals, 1.984 GiB StorageDealError: 223 deals, 435.6 GiB

kernelogic commented 3 years ago

Another node of mine has even more stuck in this state

Total: 1157 deals, 2.201 TiB StorageDealActive: 104 deals, 199.4 GiB StorageDealSealing: 125 deals, 241.1 GiB StorageDealFundsReserved: 19 deals, 34.48 GiB StorageDealStartDataTransfer: 381 deals, 740.9 GiB StorageDealTransferring: 9 deals, 17.86 GiB StorageDealError: 519 deals, 1020 GiB

dirkmc commented 3 years ago

@kernelogic could you please let us know which version of lotus you are running

dirkmc commented 3 years ago

It would also be helpful if you could copy a few of the most common error messages you are seeing

dirkmc commented 3 years ago

It appears that the issue is caused in part by a limit (maximum of 6) on the number of in-progress queries in go-graphsync: https://github.com/ipfs/go-graphsync/blob/785e0b3bbdd6b578e3274fb91e5963c219d372a2/responsemanager/responsemanager.go#L25

When the client wants to send files to the miner

  1. The client opens a "push" data-transfer request to the miner
  2. The miner sends a graphsync query to the client, asking for the file
  3. The client responds to the graphsync request with the file

If the client is already responding to the maximum number of requests, the new request will be queued until there is an available request worker. The deal will remain in the StorageDealStartDataTransfer state.

kernelogic commented 3 years ago

I'm using the latest lotus 1.2.1

dineshshenoy commented 3 years ago

Suggestions from standup to action with three pieces: 1) Create a parameter on the client side so that they can increase the # of active deals from the current value of 6 to something larger. 2) Log an error message so that the issue is discoverable when the queue is "full" (WARN). It could trigger the client to increase the parameter in #1 or invest in more hardware or something else. 3) Double check that the deals that appear "stuck" actually become unstuck when the number of deals drops below the parameter. These shouldn't error out but wait for a transfer slot to be available. It might take more time but shouldn't really error.

kernelogic commented 3 years ago

I hope more deals can be transferring at the same time because some miners can be really slow due to geographical difference. For example my node's internet is good for uploading 50MB/s but a lot of the time it only doing 500KB/s while hundreds of deals are "StorageDealStartDataTransfer". I think the whole thing is clogged up with a few slow deals.

kernelogic commented 3 years ago

What could be the result if blindly increase it to a very large number like 100?

What can I do right now to increase the limit without waiting for an official release? I don't see this file in the lotus repo. It's on the IPFS side.

kernelogic commented 3 years ago

These are the common error messages:

dirkmc commented 3 years ago

@kernelogic

kernelogic commented 3 years ago

Thanks for the explanation. So it's all normal then, just congestion.

dirkmc commented 3 years ago

When the client creates a deal with the provider:

  1. The go-data-transfer client sends a "push" request
  2. The go-data-transfer provider receives the "push" request and opens a go-graphsync query to the client
  3. The go-graphsync client receives the query and responds with the data for the file

I believe the root cause of this issue is that the provider does not open a graphsync query to the client in step 2. This could be because

Proposed fix: In go-data-transfer, when opening a push request, set a timeout by which the provider must begin the data transfer, and fail the transfer if the timeout expires.

kernelogic commented 3 years ago

@dineshshenoy is this solved? Where?

dineshshenoy commented 3 years ago

Should have linked to #5211 which has the propose fix and is being worked on shortly. Forked off that issue to a separate item since this issue had gotten unwieldy and a lot of the earlier work has been complete.