Closed nazar-pc closed 10 months ago
@shamil-gadelshin any updates here?
I had the sync issue again on one node with Nov-17. I did run the debug again, and it synced. Full log until sync at https://pastebin.com/zuMUEWb0 (will self destruct in 4 weeks) To me it seems it tries the same two non-responding nodes over and over.
According to logs you have synced successfully, so this is not the same issue
I think https://github.com/subspace/subspace/issues/2237 is related to this one too. It appears that not having (enough?) connections just prevents it from making any progress at all even though I don't think we have infinite number of retries (or maybe we do due to a bug somewhere?).
Users in Discord report that latest release that fixed QUIC address translation now prints errors during piece retrieval, which further indicates that networking issues result in requests being stuck, in this case I guess piece retrieval specifically, but probably others.
I'm concluding that while unfortunate, this works as expected.
Essentially with both node sync from DSN and piece cache sync the following happens:
So all of the above means that the progress is being made, but extremely slowly. And without backpressure in libp2p we can't really push it further or put additional timeout externally because internally queries/requests will continue progressing.
I'm closing this for now, but it doesn't mean we will not be making improvements that will indirectly help here.
Turns out sometimes DSN sync hangs and stops making any progress:
To me this looks like a deadlock somewhere in the networking stack that prevents sync from making progress.
The next step after "Downloading last segment headers" is to get closest peers to a random key and send
SegmentHeaderRequest::LastSegmentHeaders
request. Not sure ifget_closest_peers
doesn't resolve orsend_generic_request
hangs, but we do not get response to the request in the end.Forum thread: https://forum.subspace.network/t/gemini-3g-nodes-not-syncing-on-linux/2090