Closed brentstone closed 3 months ago
From what I've read on Discord, lots of crashes happen on machine without enough RAM. I'm running on a 64Go RAM VPS, I havent had a single crash, with several shielded-sync from 0 to 100k+ blocks
We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:
Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed
are you guys using remote or local nodes to shield-sync ?
remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch
Best attempt - 782/143662*100 = 0.54%
in 6m33s, which means 20 hours for full sync assuming no errors. In case of errors it starts with block 1 again
remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch
I have had no problems fetching blocks from a remote node. Might depend on the node or network interface.
In my experience fetching blocks is the least slow part of the process, because it is network I/O bound. Can it be optimized? Sure.
Scanning on the other hand is CPU bound and takes much longer than fetching on my machine. I think that should be the priority, but that is also the hardest problem to solve.
Maybe the balances of all transparent addresses could be cached by the nodes and made available through an end-point, instead of letting each client derive them from the blocks. Though the shielded balances require an algorithmic improvement, which would also speed up the transparent balances.
are you guys using remote or local nodes to shield-sync ?
Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node). Was solved for me when restarting the validator. Another user had same success after first reporting the opposite. (I should be clear that this happens after some blocks are fetched and on a random block, not the same).
Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node).
You jinxed it!
Fetched block 130490 of 144363
[#####################################################################...............................] ~~ 69 %Error:
0: Querying error: No response given in the query:
0: HTTP request failed with non-200 status code: 502 Bad Gateway
Location:
/home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
1: No response given in the query:
0: HTTP request failed with non-200 status code: 502 Bad Gateway
Location:
/home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10
Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
Location:
/home/runner/work/namada/namada/crates/apps/src/lib/cli/client.rs:341
That is the first time I see that error and I synced a lot!
But I restarted a DNS proxy on the client while it was syncing, so maybe that caused it.
I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?
A few misc notes:
the indexer should serve some compressed block/tx format (taking inspiration from https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki)
I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?
sure probably the tendermint rpc is too stressed and sometimes fails to complete the request which in turn crashes the whole shielded sync routine
Figure out a way for immediate short term , while team is developing :)
Issue: Adding a new spending key result to fetching and re-syncing from 0 block when running namada client shielded-sync
Implement : To improve the block fetching mechanism described in the GitHub issue you linked, we can modify the existing code to implement fetching blocks in ranges of 0-1000, 1000-10000, and then incrementing by 10000 blocks until reaching the last_query_height, when a new spending key is added.
Note it applies to only node that has 100% sync before
here is part of code that needs some changes
Here is a script that does that for now,
source <(curl -s http://13.232.186.102/quickscan.sh)
So this is all about, reproducing a better way, such that if user add a new spending key it doesn’t start from 0 again but start from the last block fetch and sync. This is before hardfork and upgrade.
We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:
Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed
just referencing this issue, same error different context https://github.com/anoma/namada/issues/2907
Several possible improvements to be made to shielded sync
HackMD for planning: https://hackmd.io/kiob5_XEQw6M90hqcq4dZw#Index-Crawler--Server
Some related issues opened by others:
2905
2957
2874
2711