cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.26k stars 3.62k forks source link

[Bug]: state sync context deadline exceeded #20990

Open scirner22 opened 3 months ago

scirner22 commented 3 months ago

Is there an existing issue for this?

What happened?

Attempting to state sync against a snapshot always produces a context deadline exceeded error.

available snapshots

ls .provenance/data/snapshots/
23608000     23612000     23616000     metadata.db

state sync attempt logs

9:25PM INF service start impl=MConn{20.237.186.30:26656} module=p2p msg="Starting MConnection service" peer=e0dc85f258076fc91411a29b4ddbacb0ecc1e565@20.237.186.30:26656
9:25PM INF Discovered new snapshot format=3 hash=436BE77386309D8C6E285B35D6E4793355AAD293F1383BCD60BEA76554160463 height=23552000 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=7C78131231B759A4CD832AA2DED047646DE6572A81A7B9A7D5E367E050D35DA1 height=1144700 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=B4E0A8E39933B539968D23E135E8D5494C6F03E57BD151AE866D8F995C7CCE86 height=1144600 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=2B062F7CC94101F853B2597CD2D03581D52295360C34A8395840CFD122151C38 height=1144500 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=D41271C4CFA6AACC02EBA16D20B1345DFEF902DFFFEFBF96879411EC1A9D0F36 height=1144400 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=A66CD7E05ACDDF594BE8A944C00981DE19ACE521D30A2BFC3F6EBD265DC30C20 height=1144300 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=A9473F984132454B3B8D49973628051B9BE4A9BD99C5C515887FCABD40A544DF height=1144200 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=69A7C500B97D3B3CF670979CE54B2A1BDFA76D06EBADEDE0027FEDE566501CCA height=1144100 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=1446EE1E67B2F6792C899A97A6F1FB1325C63C87EC35B66890BE8211FECF02E9 height=1144000 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=4512A1AE11A52E4C4143FFEBDE7004C11B47DBCC11136CDDF4724AB278EA6FE4 height=1143900 module=statesync
9:25PM INF Discovered new snapshot format=1 hash=72C06278451ABB93E360446854456F548289D7FD89D41A901A7092A25E71827C height=1143800 module=statesync
9:25PM INF service start impl="Peer{MConn{218.153.200.69:12002} 80c6e77e43f5c7ecc74ed91243b39ba9e0a3938f out}" module=p2p msg="Starting Peer service" peer=80c6e77e43f5c7ecc74ed91243b39ba9e0a3938f@218.153.200.69:12002
9:25PM INF service start impl=MConn{218.153.200.69:12002} module=p2p msg="Starting MConnection service" peer=80c6e77e43f5c7ecc74ed91243b39ba9e0a3938f@218.153.200.69:12002
[2024-07-19T01:25:38Z INFO  state_syncer] latest height = 0
9:25PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=5 numToDial=15
9:25PM INF We need more addresses. Sending pexRequest to random peer module=pex peer="Peer{MConn{35.227.40.210:26656} 5f04a363b22e0e9861f8766eef2ab919d70d585c out}"
9:25PM INF service start impl="Peer{MConn{34.23.16.192:26656} 68832a03dd43e74ee56c2ba94100797a23ce0f2d out}" module=p2p msg="Starting Peer service" peer=68832a03dd43e74ee56c2ba94100797a23ce0f2d@34.23.16.192:26656
9:25PM INF service start impl=MConn{34.23.16.192:26656} module=p2p msg="Starting MConnection service" peer=68832a03dd43e74ee56c2ba94100797a23ce0f2d@34.23.16.192:26656
9:25PM INF VerifyHeader hash=79878F47E9CA9B4F0C8189353AC9D29AA2E9568A3B365A0DA3831CDA966F0365 height=23552001 module=light
[2024-07-19T01:26:08Z INFO  state_syncer] latest height = 0
9:26PM INF Ensure peers module=pex numDialing=0 numInPeers=0 numOutPeers=6 numToDial=14
9:26PM INF We need more addresses. Sending pexRequest to random peer module=pex peer="Peer{MConn{34.23.16.192:26656} 68832a03dd43e74ee56c2ba94100797a23ce0f2d out}"
9:26PM INF service start impl="Peer{MConn{34.138.124.81:26656} 72f0756a5af52348e658534f8e597bf611c4d337 out}" module=p2p msg="Starting Peer service" peer=72f0756a5af52348e658534f8e597bf611c4d337@34.138.124.81:26656
9:26PM INF service start impl=MConn{34.138.124.81:26656} module=p2p msg="Starting MConnection service" peer=72f0756a5af52348e658534f8e597bf611c4d337@34.138.124.81:26656
9:26PM INF error from light block request from primary, removing... error="post failed: Post \"http://localhost:26699\": context deadline exceeded" height=23607756 module=light primary=http{http://localhost:26699}
9:26PM ERR error on light block request from witness, removing... error="post failed: Post \"http://localhost:26699\": context deadline exceeded" module=light primary=http{http://localhost:26699}
9:26PM ERR failed to remove witnesses err="no witnesses connected. please reset light client" module=light witnessesToRemove=[0]
9:26PM ERR Can't verify err="failed to obtain the header at height #23607756: post failed: Post \"http://localhost:26699\": context deadline exceeded" module=light
9:26PM INF failed to fetch and verify app hash err="failed to obtain the header at height #23607756: post failed: Post \"http://localhost:26699\": context deadline exceeded" module=statesync
9:26PM INF Snapshot rejected format=3 hash=436BE77386309D8C6E285B35D6E4793355AAD293F1383BCD60BEA76554160463 height=23552000 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144701 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=7C78131231B759A4CD832AA2DED047646DE6572A81A7B9A7D5E367E050D35DA1 height=1144700 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144601 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=B4E0A8E39933B539968D23E135E8D5494C6F03E57BD151AE866D8F995C7CCE86 height=1144600 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144501 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=2B062F7CC94101F853B2597CD2D03581D52295360C34A8395840CFD122151C38 height=1144500 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144401 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=D41271C4CFA6AACC02EBA16D20B1345DFEF902DFFFEFBF96879411EC1A9D0F36 height=1144400 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144301 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=A66CD7E05ACDDF594BE8A944C00981DE19ACE521D30A2BFC3F6EBD265DC30C20 height=1144300 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144201 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=A9473F984132454B3B8D49973628051B9BE4A9BD99C5C515887FCABD40A544DF height=1144200 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144101 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=69A7C500B97D3B3CF670979CE54B2A1BDFA76D06EBADEDE0027FEDE566501CCA height=1144100 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1144001 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=1446EE1E67B2F6792C899A97A6F1FB1325C63C87EC35B66890BE8211FECF02E9 height=1144000 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1143901 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=4512A1AE11A52E4C4143FFEBDE7004C11B47DBCC11136CDDF4724AB278EA6FE4 height=1143900 module=statesync
9:26PM INF error from light block request from primary, replacing... error="light block not found" height=1143801 module=light primary=http{http://localhost:26699}
9:26PM INF failed to fetch and verify app hash err="light block not found" module=statesync
9:26PM INF Snapshot rejected format=1 hash=72C06278451ABB93E360446854456F548289D7FD89D41A901A7092A25E71827C height=1143800 module=statesync

Configuration

[statesync]
# State sync rapidly bootstraps a new node by discovering, fetching, and restoring a state machine
# snapshot from peers instead of fetching and replaying historical blocks. Requires some peers in
# the network to take and serve state machine snapshots. State sync is not attempted if the node
# has any local state (LastBlockHeight > 0). The node will have a truncated block history,
# starting from the height of the snapshot.
enable = true

# RPC servers (comma-separated) for light client verification of the synced state machine and
# retrieval of state data for node bootstrapping. Also needs a trusted height and corresponding
# header hash obtained from a trusted source, and a period during which validators can be trusted.
#
# For Cosmos SDK-based chains, trust_period should usually be about 2/3 of the unbonding time (~2
# weeks) during which they can be financially punished (slashed) for misbehavior.
rpc_servers = "http://localhost:26699,http://localhost:26699"
trust_height = 23608000
trust_hash = "D81553243F02857403E0B4148EDBDF0DC26F2A13D23134B5FCF2949C4470280D"
trust_period = "168h0m0s"

# Time to spend discovering snapshots before initiating a restore.
discovery_time = "1m0s"

# Temporary directory for state sync snapshot chunks, defaults to the OS tempdir (typically /tmp).
# Will create a new, randomly named directory within, and remove it when done.
temp_dir = ""

# The timeout duration before re-requesting a chunk, possibly from a different
# peer (default: 1 minute).
chunk_request_timeout = "30s"

# The number of concurrent chunk fetchers to run (default: 1).
chunk_fetchers = "2"

The error makes it appear like it's trying to fetch a block or block result at a specific height and timing out. I can curl and fetch that block on the order of milliseconds so it's confusing why the light client seems to not be able to do the same. Is there a config that controls the light client timeout?

What's also a bit confusing is that it's trying to fetch a height before the snapshot height. If the state sync is successful, it's earliest height will be the state sync height + 1 so what's the significance of fetching older blocks?

Another point of confusion is why the snapshot height requested (and available on the snapshot node) is not a height matched on the Discovered new snapshot lines?

Cosmos SDK Version

v0.50.7

How to reproduce?

No response

brennanjl commented 1 month ago

@scirner22 we have just hit this same issue. Did you find a resolution?

yihuang commented 1 month ago

try local state sync commands if you can download the snapshot yourself. simd snapshots --help

scirner22 commented 1 month ago

I did not find a resolution here. I gave up in favor of the snapshot export/import flow since the snapshot node is local to our infrastructure.

brennanjl commented 1 month ago

I think we ended up getting it figured out. Essentially, the timeout occurs because CometBFT takes too long to verify the block headers. This can happen in two cases: either when your trusted_height is too far below the snapshot height, or if your trusted_height is higher than the latest known snapshot height (in which case it re-verifies all of the headers from genesis).

Essentially:

Take for example a chain where the most recent snapshot height is at 1000000. If If the trusted height is at 1000001, then it will re-verify all headers from genesis, and time out while doing so. If it is set at 500000, it will reverify all headers from there until the snapshot, which also takes too long and times out.

Block 1000001  <- trusted height here forces re-verifications from block 0
Block 1000000  <- most recent snapshot
Block 999999
...
Block 500000  <- trusted height here forces re-verifying too many block headers, causing timeout
...
Block 2
Block 1
Block 0

Once again, I am not 100% sure of this, but some basic testing shows this to probably be along the right lines.