filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

Retrieval times out v1.13.2-rc7 #7732

Closed Angelo-gh3990 closed 5 months ago

Angelo-gh3990 commented 2 years ago

Checklist

Lotus component

Lotus Version

Daemon:  1.13.2-rc1+mainnet+git.52c7fa242+api1.2.0
Local: lotus-miner version 1.13.2-rc1+mainnet+git.52c7fa242

Describe the Bug

I get a timeout when trying to do a retrieve of a deal: (it does not gets saved)

/filclient/filc# ./filc retrieve --no-ipfs=true --miner=f000000 mAghSIFPWgF2bFIg/oIZEPypBRg8mjRSqdNtRrJKbNTMqX1dy -o /raid1-data/filclient/filc/

21:52:03 INFO Querying FIL retrieval candidates... 21:52:04 INFO Got back 1 retrieval query results of a total of 1 candidates 21:52:04 INFO Attempting FIL retrieval with miner f000000 from root CID bafybeict22af3gyura72bvseh4vecrqpe9grjktu3ni2zeu3guzsux2xoi (0 FIL) 5440040661 (5.1 GiB))

22:34:07 ERROR 12D3KooWQnQNnyvKgReSftX4o7LAWsq8FHUFtB3kxEKH6rqvmDuu-12D3KooWAkfsqw9HxWHcNhtFSmqeVcVmopdqaV7sHURYxYyyoMkT-1638478323675277129: closing data-transfer channel: 12D3KooWQnQNnyvKgReSftX4o7LAWsq8FHUFtB3kxEKH6rqvmDuu-12D3KooWAkfsqw9HxWHcNhtFSmqeVcVmopdqaV7sHURYxYyyoMkT-1638478323675277129: timed out waiting 40m0s for Complete message from remote peer

Logging Information

market:
2021-12-03T00:01:36.920+0100    INFO    markets loggers/loggers.go:30   retrieval provider event        {"name": "ProviderEventDealAccepted", "deal ID": "766899", "receiver": "12D3KooWQnQNnyvKgReSftX4o7LAWsq8FHUFtB3kxEKH6rqvmDuu", "state": "DealStatusUnsealing", "message": ""}
2021-12-03T00:01:36.946+0100    INFO    stores  stores/remote.go:677    Read remote http://x.x.x.x:2345/remote/unsealed/s-t000000-9999 (+0,8589934592)
2021-12-03T00:01:36.946+0100    DEBUG   advmgr  sector-storage/piece_provider.go:116    result of first tryReadUnsealedPiece: r=&{body:0xc014bb2c40 mu:{state:0 sema:0} closed:false rerr:<nil> fn:0xa38380 earlyCloseFn:0xa38300}, err=%!s(<nil>)
2021-12-03T00:01:36.946+0100    DEBUG   advmgr  sector-storage/piece_provider.go:156    unsealed piece already exists, no need to unseal, sector={ID:{Miner:10000 Number:9999} ProofType:8}, pieceOffset=0, size=8522825728
2021-12-03T00:01:37.359+0100    DEBUG   advmgr  sector-storage/piece_provider.go:165    returning reader to read unsealed piece, sector={ID:{Miner:10000 Number:9999} ProofType:8}, pieceOffset=0, startOffset=0, size=8522825728
2021-12-03T00:01:37.380+0100    INFO    markets loggers/loggers.go:30   retrieval provider event        {"name": "ProviderEventUnsealComplete", "deal ID": "766899", "receiver": "12D3KooWQnQNnyvKgReSftX4o7LAWsq8FHUFtB3kxEKH6rqvmDuu", "state": "DealStatusUnsealed", "message": ""}
2021-12-03T00:01:37.380+0100    INFO    dt-impl impl/impl.go:391        resume channel 12D3KooWQnQNnyvKgReSftX4o7LAWsq8FHUFtB3kxEKH6rqvmDuu-12D3KooWAkfsqw9HxWHcNhtFSmqeVcVmopdqaV7sHURYxYyyoMkT-1638478323675277129

miner:
2021-12-03T00:01:36.696+0100    INFO    stores  stores/http_handler.go:187      SERVE Alloc check /remote/unsealed/s-t000000-9999/8/allocated/0/8522825728
2021-12-03T00:01:36.696+0100    DEBUG   stores  stores/http_handler.go:281      returning ok: worker has unsealed file with unsealed piece, sector:{Miner:10000 Number:9999}, offset:0, size:8522825728
2021-12-03T00:01:36.912+0100    INFO    stores  stores/http_handler.go:187      SERVE Alloc check /remote/unsealed/s-t000000-9999/8/allocated/0/8522825728
2021-12-03T00:01:36.912+0100    DEBUG   stores  stores/http_handler.go:281      returning ok: worker has unsealed file with unsealed piece, sector:{Miner:10000 Number:9999}, offset:0, size:8522825728
2021-12-03T00:01:36.927+0100    INFO    stores  stores/http_handler.go:187      SERVE Alloc check /remote/unsealed/s-t000000-9999/8/allocated/0/8522825728
2021-12-03T00:01:36.927+0100    DEBUG   stores  stores/http_handler.go:281      returning ok: worker has unsealed file with unsealed piece, sector:{Miner:10000 Number:9999}, offset:0, size:8522825728
2021-12-03T00:01:36.929+0100    INFO    stores  stores/http_handler.go:187      SERVE Alloc check /remote/unsealed/s-t000000-9999/8/allocated/0/8522825728
2021-12-03T00:01:36.929+0100    DEBUG   stores  stores/http_handler.go:281      returning ok: worker has unsealed file with unsealed piece, sector:{Miner:10000 Number:9999}, offset:0, size:8522825728
2021-12-03T00:01:36.930+0100    INFO    stores  stores/http_handler.go:87       SERVE GET /remote/unsealed/s-t000000-9999

Repo Steps

  1. Run '...'
  2. Do '...'
  3. See error '...' ...
Angelo-gh3990 commented 2 years ago

So I found out that some retrievals work, but the one's that don't work start downloading and then give an timeout error like mentioned above... an error why the output file is not saved would be helpful, because you see it downloading.

magik6k commented 2 years ago

Would you be able to get miner logs after ./lotus-miner log set-level --system dagstore debug?

magik6k commented 2 years ago

Also lotus-miner pprof goroutines > goroutines.txt while the transfer is running

magik6k commented 2 years ago

Also, is your dagstore initialized? (see ./lotus-miner dagstore list-shards, all shards should be in ShardStateAvailable)

Angelo-gh3990 commented 2 years ago

when I run : lotus-miner dagstore list-shards All are in "ShardStateAvailable" Some are in "ShardStateServing" And some in "ShardStateNew" When I run : lotus-miner dagstore initialize-all --concurrency=1, it doesn't try to initialize more, so I assume the "ShardStateNew" will transition into "ShardStateAvailable" on it's own someday / somehow ? idk

I'll send you the logg's via slack as a DM, just waiting for the retrieval to timeout (40min)

@magik6k , thanks a million for looking into this

magik6k commented 2 years ago

Thanks, looked at the goroutines, and they seem to point at something weird with sector locks.

Would you be able to reproduce this with https://github.com/filecoin-project/lotus/pull/7735 applied, and then get output of lotus-miner storage locks?

Angelo-gh3990 commented 2 years ago

Building now, I will send you the info in slack later today. Thanks @magik6k

Angelo-gh3990 commented 2 years ago

I'm getting :

Before retrieval : 2976 PreCommit2 sealed(R:1) cache(W)

During retrieval : 2976 PreCommit2 sealed(R:1) cache(W)

During time-out wait (5min) :

2976    PreCommit2      sealed(R:1) cache(W)
2986    PreCommit2      sealed(R:1) cache(W)

During time-out wait (25min) : -- no sectors seen

After 40min time-out : 2987 AddPiece unsealed(W)

After stopping (ctrl-c) the retrieval : 2987 AddPiece unsealed(W)

magik6k commented 2 years ago

Can you try with https://github.com/filecoin-project/lotus/pull/7737 applied?

Angelo-gh3990 commented 2 years ago

@magik6k , yes I can in a couple of hours. Thanks

Angelo-gh3990 commented 2 years ago

@magik6k , I tried #7737 , but the same result... Download stops after while. Any logging that you need?

Angelo-gh3990 commented 2 years ago

with the new v1.13.2-rc2 I get the following when I cancel after timeout :

2021-12-08T23:20:05.804+0100    INFO    graphsync       impl/graphsync.go:468   Graphsync ReceiveError from 12D3KooWQnQNnxvKgReSftX9o7LAWsq8FHUFtB3kxEKH6rqvmDuu: stream reset
2021-12-08T23:20:05.804+0100    INFO    graphsync       messagequeue/messagequeue.go:252        graphsync send error: stream reset
2021-12-08T23:20:05.805+0100    INFO    graphsync       messagequeue/messagequeue.go:219        cant open message sender 
2021-12-08T23:20:05.808+0100    INFO    graphsync       responsemanager/server.go:230   response failed: %!w(queryexecutor.errorString=network error)
Reiers commented 2 years ago

Is it still a issue for you on v1.13.2-rc4 ? @Angelo-gh3990

Angelo-gh3990 commented 2 years ago

@Reiers yes :-)

Angelo-gh3990 commented 2 years ago

I just tried some other ones with filc and still having the same issue on rc4.

magik6k commented 2 years ago

If this still happens on latest master, would you be able to provide goroutines again, like in https://github.com/filecoin-project/lotus/issues/7732#issuecomment-985349585?

Angelo-gh3990 commented 2 years ago

@magik6k I send you a DM, test on version v1.13.2-rc5 is still the same. I'll test later with the master version if needed?

Angelo-gh3990 commented 2 years ago

@magik6k @jennijuju @Reiers ok so updated to v1.13.2-rc7 updated filc to latest master ran : sysctl -w net.core.rmem_max=2500000 for bigger buffer started download, download is insanely fast ... after a while (several GB's) it slow's down a lot but keeps going Download was successful this time .... yay :-) :-) Maybe more testing is needed, but look promising :-)

rjan90 commented 2 years ago

Did some tests on on rc7, using a couple vm´s on gcloud for filc: Retrieving to US-Central1 from Norway:

./filc retrieve --no-ipfs=true --miner=f023467 uAXESIBrWesax1xXLKARn_ped66Q58PRUHGXkNc0Xp9kSfNFA
08:24:38  INFO  Querying FIL retrieval candidates...
2022/01/06 08:24:38 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
08:24:41  INFO  Got back 1 retrieval query results of a total of 1 candidates
08:24:41  INFO  Attempting FIL retrieval with miner f023467 from root CID bafyreia22z5mnmoxcxfsqbdh72lz325ehhypiva4mxsdltixu7mre7gria (0 FIL)
08:25:27  INFO  FIL retrieval succeeded                                                                                                                                                                                      
RETRIEVAL STATS (FIL)
-----
Size:          925308135 (882 MiB)
Duration:      45.089918131s
Average Speed: 20521397 (20 MiB/s)
Ask Price:     0 (0 FIL)
Total Payment: 0 (0 FIL)
Num Payments:  0
Peer:          12D3KooWLVyjYwoWCPYmvnCkAzwmYBmPAR2zb7g1DT3eALWU2twY
unknown node type

Retrieving to Europe-North(Finland) from Norway:

./filc retrieve --no-ipfs=true --miner=f023467 uAXESIBrWesax1xXLKARn_ped66Q58PRUHGXkNc0Xp9kSfNFA
2022/01/06 08:44:19 failed to sufficiently increase receive buffer size (was: 208 kiB, wanted: 2048 kiB, got: 416 kiB). See https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size for details.
08:44:20  INFO  Querying FIL retrieval candidates...
08:44:20  INFO  Got back 1 retrieval query results of a total of 1 candidates
08:44:20  INFO  Attempting FIL retrieval with miner f023467 from root CID bafyreia22z5mnmoxcxfsqbdh72lz325ehhypiva4mxsdltixu7mre7gria (0 FIL)
08:44:57  INFO  FIL retrieval succeeded                                                                                                                                                                                      
RETRIEVAL STATS (FIL)
-----
Size:          520689159 (497 MiB)
Duration:      8.970264838s
Average Speed: 58046130 (55 MiB/s)
Ask Price:     0 (0 FIL)
Total Payment: 0 (0 FIL)
Num Payments:  0
Peer:          12D3KooWLVyjYwoWCPYmvnCkAzwmYBmPAR2zb7g1DT3eALWU2twY
unknown node type

I did not have any success when I tried to retrieve using PieceCID baga.., they just timed out. And also, I had even better performance (average speed), when trying to retrieve just the root CID.

Angelo-gh3990 commented 2 years ago

@rjan90 A quick test here in the EU from google pulling data from my dc in Brussels

./filc retrieve --no-ipfs=true --miner=f0xxxxx mAXCg5AIgzmWXB+M4hE8NojRdaMK/lGivppRvBzA4CabNRipbnu0
05:11:22  INFO  Querying FIL retrieval candidates...
05:11:23  INFO  Got back 1 retrieval query results of a total of 1 candidates
05:11:23  INFO  Attempting FIL retrieval with miner f0xxxxx from root CID bafykbzacedhglfyh4m4iitynui2f22gcx8kgrl5gsrxpymbybgtm2rrklopo2 (0 FIL)
05:15:04  INFO  FIL retrieval succeeded
RETRIEVAL STATS (FIL)
-----
Size:          21475902791 (20 GiB)
Duration:      3m40.932077383s
Average Speed: 97205906 (93 MiB/s)
Ask Price:     0 (0 FIL)
Total Payment: 0 (0 FIL)
Num Payments:  0
Peer:          12D3KooWAkfsqw9HxWHcNhtFSmqeVcVbibdqaV7sHURYxYyyoMkT
Angelo-gh3990 commented 2 years ago

@rjan90 I forgot, did you set the bigger buffer size ? https://github.com/lucas-clemente/quic-go/wiki/UDP-Receive-Buffer-Size sysctl -w net.core.rmem_max=2500000

rjan90 commented 2 years ago

@rjan90 I forgot, did you set the bigger buffer size ?

@Angelo-gh3990 No I forgot that! Will try a new run, might speed things up even more!

Wondertan commented 2 years ago

@Angelo-gh3990, how do you guys run the node with quic? When I add a quic address logs say that there is no transport registered for such address.

Also, quic by default is only in libp2p 0.16, while lotus(0.13.2-rcx) relies on 0.15. The line https://github.com/filecoin-project/lotus/blob/v1.13.2-rc7/node/modules/lp2p/transport.go#L12 is not used anywhere, thus I get the no transport registered error. Is this intended btw cc @magik6k ?

TippyFlitsUK commented 2 years ago

Hey @Angelo-gh3990

Just checking in. Any improvement on the newer Lotus releases?

Many thanks!

rjan90 commented 5 months ago

Closing this ticket as the Legacy Lotus/Lotus-Miner Markets sub-system reached EOL at the end of the 31st January 2023, and is being completely removed from the code here: