Open rvagg opened 2 years ago
Marking this as a P2 for now, since "1m timeouts" are currently the most common autoretrieve error, and this is more of a data gathering task.
I was able to replicate this locally on Boost doing retrieval testing with lassie, and it was due to an issue with the unsealing RPC connection resetting between lotus and boost. It looks like the data transfer is cancelled by Boost but instead of getting that event on the client side it's seeing the unconfirmed block.
Running the retrieval:
lassie fetch -p --provider=/ip4/127.0.0.1/tcp/50000/p2p/12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ -o bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy.car bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy
Fetching bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy from {12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ: [/ip4/127.0.0.1/tcp/50000]}
Querying indexer for bafykbzacea3m7wgziavfqkppcl7wv4pos3yhtrqwhlaoqtqsyddfr6ikuljiy...
Found 1 storage providers candidates from the indexer, querying it:
12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ
Querying [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (started)...
Querying [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (connected)...
Query response from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ]: size=7.9 MiB, price-per-byte=0, unseal-price=0, message=
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (started)...
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (proposed)...
Retrieving from [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ] (accepted)...
retrieval failed: data transfer failed: unconfirmed block transfer
Retrieval failure for [12D3KooWGVQedkEJauq14rvq91UTYPXfNZzBgLwhtqAVPpKGa9PJ]: retrieval failed: data transfer failed: unconfirmed block transfer
all retrievals failed
Event Logs on Boosts side 2023-02-23 12:59:57.393 | DT:Open | |||
---|---|---|---|---|
2023-02-23 12:59:57.516 | 123ms | Open | New | |
2023-02-23 12:59:57.690 | 174ms | DT:NewVoucherResult | ||
2023-02-23 12:59:57.749 | 59ms | DT:Accept | ||
2023-02-23 12:59:57.858 | 109ms | DealAccepted | Unsealing | |
2023-02-23 12:59:57.906 | 48ms | DT:PauseResponder | ||
2023-02-23 12:59:58.090 | 184ms | UnsealError | Failing | failed to load blockstore for piece baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to acquire shard for piece CID baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to unseal deal 2: RPCConnectionError |
2023-02-23 12:59:58.148 | 58ms | DT:Cancel | ||
2023-02-23 12:59:58.334 | 186ms | CancelComplete | Errored | failed to load blockstore for piece baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to acquire shard for piece CID baga6ea4seaqpdt7hwh6gdcuowwvx3v5ooblgbcdbnsax3x77dtr7azqsrhxoaiy: failed to unseal deal 2: RPCConnectionError |
2023-02-23 12:59:58.376 | 42ms | DT:ReceiveDataError | stream reset | |
2023-02-23 12:59:58.549 | 173ms | DT:CleanupComplete | stream reset |
So @jacobheun what I read here is that the SP had an unseal error, canceled the retrieval, but the Lassie side recorded it as success (until it saw the lack of blocks)
Does that look right to you?
^^^ Damn my initial hot take looks right! Who knew 6 months ago me.
So @jacobheun what I read here is that the SP had an unseal error, canceled the retrieval, but the Lassie side recorded it as success (until it saw the lack of blocks)
Yes, exactly this.
This is now our next highest transfer error after #120. But it's also new. Discovered last week, validated here with https://github.com/application-research/autoretrieve/pull/121 and turned from a silent error
success
into an explicitfailure
coming out of Filclient with https://github.com/application-research/filclient/pull/101. So now we getdata transfer failed: unconfirmed block transfer
in the events database inevent_details>error
.Summary of problem: a data-transfer completes, giving no indication of errors or problems, but the blockstore doesn't have anything new. Graphsync is given the blockstore to store new blocks in and check for the pre-existence of blocks for a given transfer so it doesn't need to double-up. This error is triggered when we finish a data transfer and check the blockstore for the original requested payload CID and it isn't there. So Graphsync isn't receiving it, and it shouldn't be skipping it either.
Debugging such a transfer, we can see that graphsync gets two requests to deal with:
i.e. RequestPaused then RequestCancelled, and no blocks get mentioned or transferred at all for these. There’s no partial, completed, failure, or anything else.
@hannahhoward's initial take is:
For now we've just turned it into an explicit failure on the client side, but we need to understand the causes and figure out if we can decrease the number of these we're encountering.