filecoin-project / lotus

Reference implementation of the Filecoin protocol, written in Go
https://lotus.filecoin.io/
Other
2.85k stars 1.27k forks source link

[BUG] Storage deal fails with "waiting 30s for Complete message from remote peer" on v1.9.0 #6265

Closed cryptowhizzard closed 3 years ago

cryptowhizzard commented 3 years ago

Describe the bug Dealmaking seems to be broken on the latest v1.9.0-rc5 version. The client does not connect to the miner.

To Reproduce Steps to reproduce the behavior:

lotus client deal --from xxxxx etc.

Logs

From miner side :

cat /xxxxxx/lotus-miner.log | grep bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4
2021-05-15T22:49:57.738+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventOpen", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx 4", "state": "StorageDealValidating", "message": ""}
2021-05-15T22:49:57.742+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqk eekawghx4", "state": "StorageDealAcceptWait", "message": ""}
2021-05-15T22:49:57.761+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataRequested", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mq keekawghx4", "state": "StorageDealWaitingForData", "message": ""}
2021-05-15T22:49:57.765+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbm| dyc2u2mqkeekawghx4", "state": "StorageDealTransferring", "message": ""} | 2021-05-15T22:54:25.696+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferCancelled", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbm| dyc2u2mqkeekawghx4", "state": "StorageDealFailing", "message": "data transfer cancelled"} | 2021-05-15T22:54:25.698+0200 WARN providerstates providerstates/provider_states.go:536 deal bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4 failed: data transfer cancelled | 2021-05-15T22:54:25.755+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventFailed", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawg| hx4", "state": "StorageDealError", "message": "data transfer cancelled"} |


From client side :

cat /xxxxx/lotus-miner.log | grep bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4
2021-05-15T22:49:57.738+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventOpen", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx 4", "state": "StorageDealValidating", "message": ""}
2021-05-15T22:49:57.742+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqk eekawghx4", "state": "StorageDealAcceptWait", "message": ""}
2021-05-15T22:49:57.761+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataRequested", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mq keekawghx4", "state": "StorageDealWaitingForData", "message": ""}
2021-05-15T22:49:57.765+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbm dyc2u2mqkeekawghx4", "state": "StorageDealTransferring", "message": ""}
2021-05-15T22:54:25.696+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferCancelled", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbm dyc2u2mqkeekawghx4", "state": "StorageDealFailing", "message": "data transfer cancelled"}
2021-05-15T22:54:25.698+0200 WARN providerstates providerstates/provider_states.go:536 deal bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4 failed: data transfer cancelled
2021-05-15T22:54:25.755+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventFailed", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawg hx4", "state": "StorageDealError", "message": "data transfer cancelled"}
root@lotus-core:~/php# cat /root/lotus.log | grep bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx
2021-05-15T22:49:57.729+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventOpen", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4" , "state": "StorageDealReserveClientFunds", "message": ""}
2021-05-15T22:49:57.729+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundsReserved", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqke ekawghx4", "state": "StorageDealReserveClientFunds", "message": ""} | 2021-05-15T22:49:57.730+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundingComplete", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mq| keekawghx4", "state": "StorageDealFundsReserved", "message": ""} | 2021-05-15T22:49:57.763+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventInitiateDataTransfer", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc| 2u2mqkeekawghx4", "state": "StorageDealStartDataTransfer", "message": ""} | 2021-05-15T22:49:57.763+0200 INFO storagemarket_impl clientstates/client_states.go:157 sending data for a deal bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4 | 2021-05-15T22:49:57.766+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferInitiated", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdy| c2u2mqkeekawghx4", "state": "StorageDealTransferring", "message": ""}
2021-05-15T22:51:10.596+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferFailed", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u| 2mqkeekawghx4", "state": "StorageDealFailing", "message": "failed to complete data transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZ| kqripoJgN7eWruvXXqC2kG9f-1620897648502072936: timed out waiting 30s for Complete message from remote peer"} | 2021-05-15T22:51:10.596+0200 ERROR storagemarket_impl clientstates/client_states.go:315 deal bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawghx4 failed: failed to complete data | transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1620897648502072936: timed out waiting 30s for Complete me| ssage from remote peer | 2021-05-15T22:51:10.596+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFailed", "proposal CID": "bafyreiaqvvlxnjfkrrjsonvr32az6gxiam4yfpbmdyc2u2mqkeekawgh | 4", "state": "StorageDealError", "message": "failed to complete data transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWru| vXXqC2kG9f-1620897648502072936: timed out waiting 30s for Complete message from remote peer"}

dirkmc commented 3 years ago

The error message timed out waiting 30s for Complete message from remote peer indicates that after all the data finished transferring, the client waited 30s for the provider to send a "Completed" message, but the provider never sent that message.

@cryptowhizzard are you seeing this error every time you make a storage deal? Does it go away if you restart the miner?

It's possible that it's caused by a deadlock in the graphsync transport in go-data-transfer on the provider side. The deadlock is fixed by https://github.com/filecoin-project/lotus/pull/6149

cryptowhizzard commented 3 years ago

@dirkmc

Yes , the error occurs every time , 100%.

The situation is as follows : The machine is the lotus client + the lotus miner. I am sending the deal out from the same machine to the same machine ( So lotus daemon + lotus miner + lotus client deal on same ).

I did further investigation to help you , so i split the setup for test. Result : Lotus daemon @ 1.9.5 on and external machine , making the deals , miner on same machine as before ... problem remains the same , 100% fail.

One step further , i downgraded the testmachine with lotus daemon to 1.8. Sending the deal out to our miner @ 1.9.5 works , 100%.

P.s. .... i also tried this with the master release + #6149 . The problem remains. Only downgrading to 1.8 works ( for the client side sending out the deal ).

So my conclusion is that this problem is not in the miner side but on the client.

dirkmc commented 3 years ago

Thanks @cryptowhizzard , this is very helpful.

To summarize, you're saying that when the client and provider (miner) are running lotus v1.9.0-rc5, the deal fails with timed out waiting 30s for Complete message from remote peer. But when the client is running lotus v1.8.0 the deal succeeds. Is that correct?

The version of go-fil-markets shipped with lotus v1.8.0 doesn't check whether the provider sends a Complete message within the timeout, which may be why you're not seeing an error message when running lotus v1.8.0 on the client.

i downgraded the testmachine with lotus daemon to 1.8. Sending the deal out to our miner @ 1.9.5 works , 100%

When you say it works, do you mean that in the lotus client logs you see ClientEventDataTransferComplete and the deal moves to the StorageDealCheckForAcceptance state?

If you're comfortable modifying code, would you like to try making the Complete message timeout longer and see if that fixes the problem? The variable is set here: https://github.com/filecoin-project/lotus/blob/d3ad897fe98464343ecbc00f3895e52eca56412f/node/modules/client.go#L156

cryptowhizzard commented 3 years ago

@dirkmc

"To summarize, you're saying that when the client and provider (miner) are running lotus v1.9.0-rc5, the deal fails with timed out waiting 30s for Complete message from remote peer. But when the client is running lotus v1.8.0 the deal succeeds. Is that correct?"

Yes.

"When you say it works, do you mean that in the lotus client logs you see ClientEventDataTransferComplete and the deal moves to the StorageDealCheckForAcceptance state?"

From my 1.8 client machine i see this:

2021-05-17T23:31:18.293+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventOpen", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5xln5ooyakdoad g4g3jpzrv6xmoeci", "state": "StorageDealReserveClientFunds", "message": ""} 2021-05-17T23:31:18.293+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundsReserved", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5xln5 ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealReserveClientFunds", "message": ""} 2021-05-17T23:31:18.293+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundingComplete", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5xl n5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealFundsReserved", "message": ""} 2021-05-17T23:31:18.663+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventInitiateDataTransfer", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qyd kj5xln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealStartDataTransfer", "message": ""} 2021-05-17T23:31:18.663+0200 INFO storagemarket_impl clientstates/client_states.go:153 sending data for a deal bafyreigqdrj7zkbaocgfyf3qydkj5xln5ooyakdoadg4g3jpzrv6xmo eci 2021-05-17T23:31:18.870+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferInitiated", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qy dkj5xln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealTransferring", "message": ""} 2021-05-17T23:46:39.719+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferComplete", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qyd kj5xln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": ""} 2021-05-17T23:46:40.826+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-17T23:46:41.937+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-17T23:46:43.048+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"}

2021-05-18T00:39:21.817+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T00:39:22.922+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T00:39:24.160+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T00:39:25.270+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T00:39:26.378+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:27.486+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:28.685+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:29.792+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:30.899+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:32.006+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:33.114+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T00:39:34.219+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreigqdrj7zkbaocgfyf3qydkj5x ln5ooyakdoadg4g3jpzrv6xmoeci", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} etc. I can do a try tonight or tomorrow for you to see if a larger timeout will fix the issue.
dirkmc commented 3 years ago

I can do a try tonight or tomorrow for you to see if a larger timeout will fix the issue.

Thanks, that would be helpful 👍

To solve the issue I think we need to understand why the provider is not sending the "Complete" message at the end of the data transfer in lotus v1.9.0-rc5

Could you please enable these logs on the client & miner to see if they give us a clue:

# Client:
lotus log set-level --system dt-impl --system dt-chanmon --system dt_graphsync --system markets --system data_transfer_network debug

# Miner:
lotus-miner log set-level --system dt-impl --system dt-chanmon --system dt_graphsync --system markets --system data_transfer_network --system retrievaladapter --system retrieval --system advmgr --system go-car --system gs-traversal debug

Note that you can do this at runtime, no need to restart anything.

At the end of the transfer you should see a log on the provider side that says something like "sending completion message to initiator" and a log on the client side that says something like "received complete response, completing channel".

Thanks again for helping us track down these bugs 🙏

jennijuju commented 3 years ago

related #6273 and https://github.com/filecoin-project/go-data-transfer/issues/207

cryptowhizzard commented 3 years ago

hi @dirkmc

I don't think the debug commands gave anything more usefull , however your second solution works.

With no alter of code , but just with the extra debugging commands on client : 2021-05-18T22:11:41.989+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventOpen", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealReserveClientFunds", "message": ""} 2021-05-18T22:11:41.989+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundsReserved", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24p ", "state": "StorageDealReserveClientFunds", "message": ""} 2021-05-18T22:11:41.990+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundingComplete", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealFundsReserved", "message": ""} 2021-05-18T22:11:42.020+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventInitiateDataTransfer", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealStartDataTransfer", "message": ""} 2021-05-18T22:11:42.021+0200 INFO storagemarket_impl clientstates/client_states.go:157 sending data for a deal bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe 2021-05-18T22:11:42.023+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferInitiated", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealTransferring", "message": ""} 2021-05-18T22:12:53.825+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferFailed", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealFailing", "message": "failed to complete data transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1621337357679346570: timed out waiting 30s for Complete message from remote peer"} 2021-05-18T22:12:53.825+0200 ERROR storagemarket_impl clientstates/client_states.go:315 deal bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe failed: failed to complete data transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1621337357679346570: timed out waiting 30s for Complete message from remote peer 2021-05-18T22:12:53.825+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFailed", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealError", "message": "failed to complete data transfer: deal data transfer failed: 12D3KooWQrktQtwVbMBoeoN6te6i26s1gQPVfTYVgFs1U4ePmpys-12D3KooWNHwmwNRkMEP6VqDCpjSZkqripoJgN7eWruvXXqC2kG9f-1621337357679346570: timed out waiting 30s for Complete message from remote peer"}

With no alter of code , but just with the extra debugging commands on miner: 2021-05-18T22:11:41.996+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventOpen", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealValidating", "message": ""} 2021-05-18T22:11:42.001+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDealDeciding", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealAcceptWait", "message": ""} 2021-05-18T22:11:42.019+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataRequested", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealWaitingForData", "message": ""} 2021-05-18T22:11:42.022+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreica2f5ahh6my6gnfvdjuq6qx3no5amrzkzkbvz4tg5gjr2r6j24pe", "state": "StorageDealTransferring", "message": ""}

After raising the timeout to 600 seconds all succeeds ( ( lotus client / lotusminer and deal on same machine )

2021-05-18T22:30:51.937+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventOpen", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealReserveClientFunds", "message" : ""} 2021-05-18T22:30:51.937+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundsR eserved", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealReserveClientFunds", "message": ""} 2021-05-18T22:30:51.937+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventFundin gComplete", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealFundsReserved", "me ssage": ""} 2021-05-18T22:30:51.975+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventInitia teDataTransfer", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealStartDataTrans fer", "message": ""} 2021-05-18T22:30:51.976+0200 INFO storagemarket_impl clientstates/client_states.go:157 sending data for a deal bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm 2021-05-18T22:30:51.979+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTr ansferInitiated", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealTransferring" , "message": ""} 2021-05-18T22:33:26.245+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTr ansferComplete", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAccept ance", "message": ""} 2021-05-18T22:33:27.251+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitFo rDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance ", "message": "Provider state: StorageDealTransferring"} 2021-05-18T22:33:28.258+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitFo rDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance ", "message": "Provider state: StorageDealTransferring"}

Altered code :

AcceptTimeout: 600 time.Second, CompleteTimeout: 600 time.Second,

As you can see in the timestamps it took average 2.5 minutes on the machine ( lotus client / lotusminer and deal on same machine ) so the 30 second timeout is breaking things.

The deal went into verify data after 13 minutes :

2021-05-18T22:46:27.112+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T22:46:28.117+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealTransferring"} 2021-05-18T22:46:29.124+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T22:46:30.130+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"} 2021-05-18T22:46:31.138+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventWaitForDealState", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": "Provider state: StorageDealVerifyData"}

This is with your latest patch #6149 + master.

dirkmc commented 3 years ago

Thanks for the update @cryptowhizzard 👍

I would like to see how long it takes between when the client receives all blocks, and when the provider sends the Complete message.

Could you find the log lines

Then we can compare the timestamps of the two events and that should tell us how long it takes for the provider to send the Complete message after all data has been transfered.

cryptowhizzard commented 3 years ago

XXXX@lotus-core:~# cat lotus.log | grep bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm | grep ClientEventDataTransferComplete 2021-05-18T22:33:26.245+0200 INFO markets loggers/loggers.go:15 storage client event {"name": "ClientEventDataTransferComplete", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": ""}

XXXX@lotus-core:~# cat lotus-miner.log | grep bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm | grep ProviderEventDataTransferCompleted 2021-05-18T22:46:27.473+0200 INFO markets loggers/loggers.go:20 storage provider event {"name": "ProviderEventDataTransferCompleted", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealVerifyData", "message": ""}

daemon.txt miner.txt

dirkmc commented 3 years ago

So it looks like it takes 2 minutes 35 seconds to send the file from the client:

2021-05-18T22:30:51.979+0200    INFO   markets loggers/loggers.go:15   storage client event    {"name": "ClientEventDataTransferInitiated", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealTransferring", "message": ""}
2021-05-18T22:33:26.245+0200    INFO   markets loggers/loggers.go:15   storage client event    {"name": "ClientEventDataTransferComplete", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealCheckForAcceptance", "message": ""}

But it takes more than 15 minutes to complete the transfer on the provider side:

2021-05-18T22:30:51.977+0200    INFO   markets loggers/loggers.go:20   storage provider event  {"name": "ProviderEventDataTransferInitiated", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealTransferring", "message": ""}
2021-05-18T22:46:27.473+0200    INFO   markets loggers/loggers.go:20   storage provider event  {"name": "ProviderEventDataTransferCompleted", "proposal CID": "bafyreiemrvjnt4qvohejb6w74gzo2a2wwxsmu5ss4vs5zq6idgbn5blajm", "state": "StorageDealVerifyData", "message": ""}

I would like to see the log lines for the first and last of these data transfer events:

That should tell us if the long delay on the provider is caused by taking a long time to process incoming blocks, or if it's some process that happens after the last block is received.

Could you please grep the logs for "data transfer event" on the client and provider. You should see these logs if the "markets" log subsystem is set to "debug":

# Client:
lotus log set-level --system markets debug

# Miner:
lotus-miner log set-level --system markets debug
cryptowhizzard commented 3 years ago

Hi @dirkmc

While i was sleeping someone tried making deals / retrievals. This clogged up my miner ( but is probably another issue to visit outside this topic ).

I have 4 files attached for you. 2 files with fail ( including log ) when i tried to make a deal when the miner was clogged.

After a stop -> rm -rf in the data-transfer directory -> start in the daemon + miner dir, the deal worked.

So i have the miner + daemon files for you including all timestamps and info.

Let me know if you need anything else.

lf.tar.gz

hannahhoward commented 3 years ago

Summary and recommendations (from code owner of go-graphsync and original author of go-data-transfer):

Recap

Per @dirkmc here's the summary of the problem:

  1. Client is doing a storage deal with a provider
  2. Client opens a "push" channel to provider
  3. Provider makes a graphsync request to the client asking for the data
  4. Client sends all the data for the deal to the provider over graphsync
  5. Once the client has sent all data, the client waits for the provider to acknowledge completion
  6. Once the provider has received all data, the provider acknowledges by sending a "Complete" message

The problem is that it takes longer than 30s between stages 5 and 6.

Analysis

Either one of two things is happening:

  1. The client sent the data data just hasn’t got there yet
  2. Graphsync is being slow processing it.

My understanding @cryptowhizzard is that you're running these tests on the your own machine (i.e. you're running the client as well as the miner), so we can probably eliminate network traffic time.

This leaves processing time, and based on the success vs failure depending on whether your miner is clogged, I would say we're looking at CPU resource constraints or disk based resource constraints.

ATM, I am aware of 1 way Graphsync can become CPU constrained and one way it can become disk constrained. This is not to say these are the ONLY ways it can get held up, just the ones I'm aware of.

CPU Constraint: Graphsync has to hash every block that comes in on the network. There's no way to avoid this once -- in fact it'd be a major security problem if we didn't. On a big transfer, especially on a machine that is sealing, especially if it's not running a separate seal worker process, this could for sure become a time constraint. That being said -- the current go-graphsync in Lotus does it TWICE because go-ipld-prime used to force an extra hashing. This fixed in go-graphsync v0.8.0, which has not made up the dependency chain. (and unfortunately involves a couple breaking changes)

Disk constraint: essentially, BadgerDB. If you write lots of blocks at once, especially if other things are happening, that can take time. It's hard to know exactly how long it might take. There is an endeavor planned to write directly to CAR files instead of BadgerDB, that would make a big improvement in performance, but I believe that to be a longer term effort.

There may be other constraints. Graphsync has an in depth testground benchmarking suite here: https://github.com/ipfs/go-graphsync/tree/master/testplans/graphsync -- someone may want to try different parameters to identify possible constraints. (the parameters are documented in a readme)

Recommendations

Short Term

  1. There's no arbitrary reason for a 30s constraint, and I think it should be increased for now to 10 minutes or so. Seems silly to fail a deal just cause something is taking a while.
  2. I think the data transfer status "TransferComplete" on the client side -- a name I came up with -- is extremely confusing and should be changed! I recommend strongly something like "AwaitingConfirmationOfTransfer" or something :)

Medium Term

  1. Upgrade to go-graphsync v0.8.0
  2. Further analysis of graphsync performance issues.

Longer Term

  1. Write directly to CAR files in go-graphsync
  2. Complete miner / markets process seperation project
dirkmc commented 3 years ago

Thanks @hannahhoward I agree with the analysis and recommendations. I'm going to open a PR to bump up the timeout to 10 minutes.

dirkmc commented 3 years ago

Fix here: https://github.com/filecoin-project/lotus/pull/6300