filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Deals stuck in "StorageDealsTransferring" with 0B transferred #5268

Closed rjan90 closed 3 years ago

rjan90 commented 3 years ago

Describe the bug A couple of miners experienced that deals with the same client suddenly started to stop in "StorageDealsTransferring" with 0 bytes transferred, after successfully sealing deals from the client over many days. This issue is based on all the feedback and logs from both miners and the client in the Slack-thread https://filecoinproject.slack.com/archives/C01AZP8BKRQ/p1609164686470600

On the miner side

lotus-miner storage-deals list -v
Dec 28 14:47:07  bafyreigpl5vm64hhowjc2g5e7reppu6hngme5ff4lkaa3ykd6h2r5zveza  0        StorageDealTransferring             f3sar6tat2mhqadlqjjxewmpqya3whpkzuqaevpo5vwfvdhnm3uztd7p67d6ozpsfkiwvrvqrhp3zu6psi23ma  16GiB  0 FIL  1052529  12D3KooWSDhUpT9YMLxRTEAf3maSR34M64yYm4XBHKeYhwtbaUo7-12D3KooWMpuRzgVuRS6m285cucGgnXWwTnsNtGzjsMJeEqS7UtJw-426   

Data-Transfers shows that the transfer is ongoing, but with 0 bytes transferred:

lotus-miner data-transfers list
ID   Status   Receiving From  Root Cid     Initiated?  Transferred  Voucher                                   
426  Ongoing  ...hwtbaUo7     ...peoy6mfa  N           0B           ...7reppu6hngme5ff4lkaa3ykd6h2r5zveza"}}  

My miner has a connection with the client:

lotus-miner net connect /ip4/X.--..--.XX/tcp/36359/p2p/12D3KooWSDhUpT9YMLxRTEAf3maSR34M64yYm4XBHKeYhwtbaUo7
connect 12D3KooWSDhUpT9YMLxRTEAf3maSR34M64yYm4XBHKeYhwtbaUo7: success

If I try to restart the channel I can see in the logs that the channel has timed out:

2020-12-28T19:12:29.049+0100    INFO    markets loggers/loggers.go:20   storage event   {"name": "ProviderEventDataTransferRestarted", "proposal CID": "bafyreigpl5vm64hhowjc2g5e7reppu6hngme5ff4lkaa3ykd6h2r5zveza", "state": "StorageDealTransferring", "message": ""}
2020-12-28T19:12:29.053+0100    WARN    dt_graphsync    graphsync/graphsync.go:150  graphsync request context cancelled, channel Id: 12D3KooWSDhUpT9YMLxRTEAf3maSR34M64yYm4XBHKeYhwtbaUo7-12D3KooWMpuRzgVuRS6m285cucGgnXWwTnsNtGzjsMJeEqS7UtJw-426
2020-12-28T19:12:29.053+0100    WARN    dt-impl impl/events.go:196  channel 12D3KooWSDhUpT9YMLxRTEAf3maSR34M64yYm4XBHKeYhwtbaUo7-12D3KooWMpuRzgVuRS6m285cucGgnXWwTnsNtGzjsMJeEqS7UtJw-426 has timed out

On the client-side

Client Deals Dec 28 21:47:07 bafyreigpl5vm64hhowjc2g5e7reppu6hngme5ff4lkaa3ykd6h2r5zveza 0 f023467 StorageDealStartDataTransfer N N baga6ea4seaqg3tsrp56bchrhumenu4cjnl57cpxudfiggettmlejqggtwf74coa 15.88 GiB 0 FIL 1052529 true

Client Transfers 426 Requested ...EqS7UtJw ...peoy6mfa Y 0B ...7reppu6hngme5ff4lkaa3ykd6h2r5zveza"}}

The client checked his network bandwidth and didn't have any problems with wget or scp to another machine. But there are no clear lotus transfers in nethogs.

Client also checked his socket limit:

use ulimit -n
get 1024
Then is use  netstat |wc
There are about 850 connects. So it may be socket connections number limit.
Then I revised the limit to 64435 and restart servers. 
Using nethogs command, I find the send rate of lotus reach 10M/s. It seems ok.
Then I check the transfers, I find the max transfer id reach 655, but the previous max transfer id is 428. I didn't add any new deals in these times.

Possible problems might be:

  1. Socket Connection Limit
  2. Lotus Client

To Reproduce Steps to reproduce the behavior:

  1. Client send deals to miners over a longer period of time
  2. Miners are able to seal the deals for a longer period of time, and then suddenly at almost the same time multiple miners experience that deals stop in "StorageDealsTransferring" state

Version (run lotus version): Daemon: 1.4.0+git.e9989d0e4+api1.0.0 Local: lotus version 1.4.0+git.e9989d0e4

Additional context Ongoing discussion is also happening in the slack-channel. And this issue also seems to be related to #4946 & #5211

rjan90 commented 3 years ago

After experiencing another batch of deals stuck in "StorageDealsTransferring"-state with the same client, I think we have pin-pointed to at least one reason why this happens.

I restarted both my lotus daemon/miner due to some maintenance, and when I got the next storage-deal proposal after restarting my miner, the deal got stuck in "StorageDealsTransferring". To solve the problem the client also restarted his lotus-process, and then the transfers started.

So steps to reproduce:

  1. Client sends deal to a miner.Transfers work normally, and the miner seals the deal.
  2. Miner restarts the miner. Client send a new deal after miner is back online.
  3. The deal gets stuck in "StorageDealsTransferring"
dirkmc commented 3 years ago

Thank you for the detailed bug report. I believe this will be fixed by https://github.com/filecoin-project/lotus/pull/5210

I want to write some comprehensive tests around restarts to make sure we've covered all cases before merging this fix so it may take a few more days.