anoma / namada

Rust implementation of Namada, a Proof-of-Stake L1 for interchain asset-agnostic privacy
https://namada.net
GNU General Public License v3.0
2.4k stars 964 forks source link

Shielded sync improvement #2900

Closed brentstone closed 3 months ago

brentstone commented 8 months ago

Several possible improvements to be made to shielded sync

HackMD for planning: https://hackmd.io/kiob5_XEQw6M90hqcq4dZw#Index-Crawler--Server

Some related issues opened by others:

phy-chain commented 8 months ago

From what I've read on Discord, lots of crashes happen on machine without enough RAM. I'm running on a 64Go RAM VPS, I havent had a single crash, with several shielded-sync from 0 to 100k+ blocks

opsecx commented 8 months ago

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

Fraccaman commented 8 months ago

are you guys using remote or local nodes to shield-sync ?

thousandsofthem commented 8 months ago

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

Best attempt - 782/143662*100 = 0.54% in 6m33s, which means 20 hours for full sync assuming no errors. In case of errors it starts with block 1 again

Rigorously commented 8 months ago

remote node don't work at all - 0% sync and already getting errors. 5 minutes sync time at most, usually ~1min until error. always starts from scratch

I have had no problems fetching blocks from a remote node. Might depend on the node or network interface.

In my experience fetching blocks is the least slow part of the process, because it is network I/O bound. Can it be optimized? Sure.

Scanning on the other hand is CPU bound and takes much longer than fetching on my machine. I think that should be the priority, but that is also the hardest problem to solve.

Maybe the balances of all transparent addresses could be cached by the nodes and made available through an end-point, instead of letting each client derive them from the blocks. Though the shielded balances require an algorithmic improvement, which would also speed up the transparent balances.

opsecx commented 8 months ago

are you guys using remote or local nodes to shield-sync ?

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node). Was solved for me when restarting the validator. Another user had same success after first reporting the opposite. (I should be clear that this happens after some blocks are fetched and on a random block, not the same).

Rigorously commented 8 months ago

Local. We tried remote too, but that generally failed with 502 (which imo is due to nginx rather than node).

You jinxed it!

Fetched block 130490 of 144363
[#####################################################################...............................] ~~ 69 %Error: 
   0: Querying error: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.
   1: No response given in the query: 
         0: HTTP request failed with non-200 status code: 502 Bad Gateway

      Location:
         /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/flex-error-0.4.4/src/tracer_impl/eyre.rs:10

      Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
      Run with RUST_BACKTRACE=full to include source snippets.

Location:
   /home/runner/work/namada/namada/crates/apps/src/lib/cli/client.rs:341

That is the first time I see that error and I synced a lot!

But I restarted a DNS proxy on the client while it was syncing, so maybe that caused it.

opsecx commented 8 months ago

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

cwgoes commented 8 months ago

A few misc notes:

Fraccaman commented 8 months ago

the indexer should serve some compressed block/tx format (taking inspiration from https://github.com/bitcoin/bips/blob/master/bip-0157.mediawiki)

Fraccaman commented 8 months ago

I think the 502 error is not the same in nature. nginx-proxied rpcs do that once in a while on other calls to. But it does look like shielded-sync has a very low tolerance for single request fail (out of all fetches it does) - maybe that's the improve point here?

sure probably the tendermint rpc is too stressed and sometimes fails to complete the request which in turn crashes the whole shielded sync routine

chimmykk commented 8 months ago

Figure out a way for immediate short term , while team is developing :)

Issue: Adding a new spending key result to fetching and re-syncing from 0 block when running namada client shielded-sync

Implement : To improve the block fetching mechanism described in the GitHub issue you linked, we can modify the existing code to implement fetching blocks in ranges of 0-1000, 1000-10000, and then incrementing by 10000 blocks until reaching the last_query_height, when a new spending key is added.

Note it applies to only node that has 100% sync before

here is part of code that needs some changes

https://github.com/anoma/namada/blob/871ab4bd388d43a186a46a595ebb4064e2175b08/crates/apps/src/lib/client/masp.rs#L38

Here is a script that does that for now,

source <(curl -s http://13.232.186.102/quickscan.sh)

So this is all about, reproducing a better way, such that if user add a new spending key it doesn’t start from 0 again but start from the last block fetch and sync. This is before hardfork and upgrade.

opsecx commented 8 months ago

We're discussing amongst some of us in discord now. For me restarting the validator seemed to do the trick. For others it did not. Unsure if RAM-related but definitely a possibility. This is the error we get though:

Querying error: No response given in the query: 0: HTTP error 1: error sending request for url (http://127.0.0.1:26657/): connection closed before message completed

just referencing this issue, same error different context https://github.com/anoma/namada/issues/2907