filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.83k stars 1.26k forks source link

when trying to make deals, if a bad miner causes an Error, all subsequent deal attempts fail even with healthy miners, until cli is killed. #8095

Closed smagdali closed 2 years ago

smagdali commented 2 years ago

Checklist

Lotus component

Lotus Version

lotus version 1.15.0-rc1+mainnet+git.3031b4d98

Describe the Bug

If, when trying to make a deal, you encounter a miner that throws an Error, that Error state persists until you kill the client and restart.

I tried this with many miners and combos (inc f0838467 f023467 f01240 f01234 f08403 f01278 f0773157) before identifying f010479 as the culprit that would poison all subsequent deal attempts.

https://github.com/filecoin-project/lotus/issues/7778 looks like it might be adjacent.

Seems like the error state is not being cleared correctly when starting the loop again.

Logging Information

2022-02-15T11:02:11.421Z        WARN    rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279  error in RPC call to 'Filecoin.ClientQueryAsk': bls signature failed to verify # BAD MINER
2022-02-15T11:02:53.701Z        WARN    rpc     go-jsonrpc@v0.1.4-0.20210217175800-45ea43ac2bec/handler.go:279  error in RPC call to 'Filecoin.ClientQueryAsk': bls signature failed to verify # GOOD MINER

# New Deal, GOOD MINER
2022/02/15 11:03:55 storess: Get /1
2022/02/15 11:03:55 storess: Get /2
2022-02-15T11:03:57.068Z        INFO    storagemarket_impl      clientstates/client_states.go:157       sending data for a deal bafyreifx6wns3cxfw3cv7xwo6ntgymbej5i6ftgpeco2sqs6y5oonn327q
2022-02-15T11:03:57.070Z        INFO    dt-impl impl/impl.go:189        open push channel to 12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f with base cid bafykbzacea6ns5swfl4rroe4sbhot3lw4vyarycejwmwpwfcxuku2sbb62qbm
2022-02-15T11:03:57.096Z        INFO    dt_graphsync    graphsync/graphsync.go:480      12D3KooWAchWy2cgE9KvH18k1S8HEVQRUpvTggVJh3HRPa8xwKT8-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1644856356955663059, GS pull request queued in response to our push, req={root:{str:p <vV*NvDMgآMH!} selector:0xc02db53820 priority:0 id:45 extensions:map[fil/data-transfer:[131 244 246 134 0 245 244 27 22 211 180 36 78 226 170 211 246 96] fil/data-transfer/1.1:[163 100 73 115 82 113 244 103 82 101 113 117 101 115 116 246 104 82 101 115 112 111 110 115 101 166 100 84 121 112 101 0 100 65 99 112 116 245 100 80 97 117 115 244 102 88 102 101 114 73 68 27 22 211 180 36 78 226 170 211 100 86 82 101 115 246 100 86 84 121 112 96] graphsync/dedup-by-key:[120 139 100 97 116 97 45 116 114 97 110 115 102 101 114 45 49 50 68 51 75 111 111 87 65 99 104 87 121 50 99 103 69 57 75 118 72 49 56 107 49 83 56 72 69 86 81 82 85 112 118 84 103 103 86 74 104 51 72 82 80 97 56 120 119 75 84 56 45 49 50 68 51 75 111 111 87 78 72 119 109 119 78 82 107 77 69 80 54 86 113 68 67 112 106 83 90 107 113 114 105 112 111 74 103 78 55 101 87 114 117 118 88 88 113 67 50 107 71 57 102 45 49 54 52 52 56 53 54 51 53 54 57 53 53 54 54 51 48 53 57]] isCancel:false isUpdate:false}
2022-02-15T11:03:57.099Z        INFO    graphsync       responsemanager/querypreparer.go:29     Processing request hooks for request: 45
2022-02-15T11:03:57.101Z        INFO    dt-impl impl/events.go:188      channel 12D3KooWAchWy2cgE9KvH18k1S8HEVQRUpvTggVJh3HRPa8xwKT8-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1644856356955663059: received new response, accepting channel
2022-02-15T11:03:57.103Z        INFO    dt_graphsync    graphsync/graphsync.go:985      incoming graphsync request      {"peer": "12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f", "graphsync request id": 45, "data transfer channel id": "12D3KooWAchWy2cgE9KvH18k1S8HEVQRUpvTggVJh3HRPa8xwKT8-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1644856356955663059"}
2022-02-15T11:04:16.010Z        INFO    dt-impl impl/events.go:270      channel 12D3KooWAchWy2cgE9KvH18k1S8HEVQRUpvTggVJh3HRPa8xwKT8-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1644856356955663059: transfer initiated by local node is complete
2022-02-15T11:04:16.044Z        INFO    dt-impl impl/events.go:205      channel 12D3KooWAchWy2cgE9KvH18k1S8HEVQRUpvTggVJh3HRPa8xwKT8-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1644856356955663059: received complete response, completing channel

Repo Steps

It's possible that 'f010479' has recovered by the time you get to this. At time of writing this was 100% reproducible.

  1. Try to make a deal with 'f010479' and then 'f01240' or any other good miners.

As soon as you've errored once, all further deals fail.

Test case:

'bad' miner - f010479 - throws a "Error: failed to query ask: bls signature failed to verify" when queried for a deal. good miner - f01240

1. Good miner offers me a deal:

Data CID (from lotus client import): bafykbzacea6ns5swfl4rroe4sbhot3lw4vyarycejwmwpwfcxuku2sbb62qbm
.. calculating data size
Deal duration (days): 300

Make this a verified deal? (yes/no): yes
Miner Addresses (f0.. f0..), none to find: f01240                                                        
.. querying miner asks
-----
Proposing from f1zk5oufzpzeokprrtwv7lsdkr5jl7fg3jkp5kplq
    Balance: 0 FIL

Piece size: 512MiB (Payload size: 408MiB)
Duration: 7200h0m0s
Total price: ~0 FIL (0 FIL per epoch)
Verified: true

Accept (yes/no): yes
.. executing
Deal (f01240) CID: bafyreibzjpjjcf3keakna7i2ei66cujxt6yogzg2swwodmvkipalwjoazy

2. Bad miner causes Error, and then Error persists for all miners until client is killed

Data CID (from lotus client import): bafykbzacea6ns5swfl4rroe4sbhot3lw4vyarycejwmwpwfcxuku2sbb62qbm
.. calculating data size
Deal duration (days): 300

Make this a verified deal? (yes/no): yes
Miner Addresses (f0.. f0..), none to find: f010479
.. querying miner asks
Error: failed to query ask: bls signature failed to verify
Miner Addresses (f0.. f0..), none to find: f01240
.. querying miner asks
Error: failed to query ask: bls signature failed to verify     # THIS SHOULD BE FINE
Miner Addresses (f0.. f0..), none to find: ^CError: reading miner address: EOF
ERROR: context canceled #Give up 
Reiers commented 2 years ago

Hi @smagdali

Thanks for the report. I have added correct labels. I noticed that you are running a old Lotus version (and the build is .dirty) there are a lot of improvements in the latest release. Could you try to upgrade Lotus, and see if you are able to reproduce the issue? I will leave it open and add to the board if the issue persist after the upgrade. Thanks !

Link to latest Lotus tags, here.

smagdali commented 2 years ago

Ah. Now I have to go and figure out where in the docs/tutorial I was following had the old link. (but will update and retest )

smagdali commented 2 years ago

Confirmed that this is still a bug with

lotus --version lotus version 1.15.0-rc1+mainnet+git.3031b4d98