Zilliqa / zq2

Zilliqa 2.0 code base
Apache License 2.0
8 stars 0 forks source link

Sharding strategy #178

Open theo-zil opened 1 year ago

theo-zil commented 1 year ago

The agreed sharding strategy is as follows:

There are three outstanding questions here:

  1. What is the communication method between the two clients? Run everything in the same node software, with intra-process communication? Or expose the necessary functionality as RPC APIs, which would allow them to run as separate processes? Note that the communication should be low latency, but at the same time if the processes are separate I don't think it should be assumed that they run on the same machine. 1.a) Are timing issues a problem? (E.g. Node 1 from shard S observes a finalized block h' in shard S', acts on that information, and includes transaction a in block A at height h. Node 2's light client has some network delays, and when Node 2's validator observes proposal A, it has not yet confirmed the finalisation of h' in S' therefore it cannot validate transaction a and rejects the proposal. Is that fine to just let happen as-is?)
  2. What is the message format like? Do we just allow any arbitrary data to be included, and transactions on the bridged-to shard can then access it and parse it as desired (e.g. allow transactions to stay pending in the mempool for some period of time, until some confirmation from another shard is received)? Or do we specify a protocol that e.g. requires encoding a specific smart contract call on the receiving shard? Existing standards may be worth evaluating here, e.g. XCM.

At a high level, the following then needs to be implemented to enable at least PoA sharding:

Timeline: I estimate that at least the subscription mechanism should be completed in the near future, so we can start using an actual sharded architecture. On the face of it, my very rough estimate is that all four steps should be doable at least in a PoC shape before testnet in September, assuming we don't discover that we require significant additional complexity.

theo-zil commented 1 year ago

Upon further discussion with @DrZoltanFazekas, the shared PoS security model requiring shards to get validated by a majority of the network actually makes me question my assumptions as to how exactly should shards be ran. Specifically pertaining to question 1 above, "what is the communication method" - if we assume separate nodes/clients, that's great for PoA, but even just having 100 PoS shards would become extremely cumbersome then if validators need to run 100 different clients.

We'd need to try to consider a way to run as many nodes as possible concurrently and efficiently, to allow validators to participate in as many shards as possible. That probably means running them in-process, for starters, and may need even more thought put in to make feasible.

This sounds like it'll likely add significant complexity over the PoA-only shards.

DrZoltanFazekas commented 1 year ago

Great summary above @theo-zil

DrZoltanFazekas commented 1 year ago

I have a comment on

Connecting to the Main Shard itself may or may not be optional

I assume connecting to the main shard will be mandatory in order to retrieve the validator sets of the bridged-from shards which is necessary to check if the block headers of those shards are valid (i.e. co-signed by 2/3 of those shards' validators)

theo-zil commented 1 year ago

That's a great point, edited my summary.

DrZoltanFazekas commented 1 year ago

1.a is a tradeoff between latency of cross-shard messages and liveness of the shard's consensus, as mentioned in the RFC. By introducing a delay (i.e. the block proposer on shard S includes a block h' from shard S' that was finalized a few slots ago instead of the "newest" block h'' that was finalized in the last slot of shard S') we can give all validators more time to get notified about block h' and vote on the proposal referring to it on shard S.

DrZoltanFazekas commented 1 year ago

If validators do not run a full node of the bridged-over x-shard but only a light client which connects to a untrusted full node, the notifications about the subscribed transactions must carry Merkle proofs of those transactions. This is part of the light client protocols because light clients only know the block headers which contain a Merkle root, and must verify if the transactions the full node notifies them about are indeed included in a block based on the Merkle root in that block and the Merkle proof received along with the transaction.

theo-zil commented 1 year ago

Indeed that's true, the light client won't trust the full nodes. Your x-shard validator will trust the light client so your x-shard can still eschew all proofs, you just gotta make sure the light client verifies them.

Note that another option is running a light node, which effectively follows consensus, and simply doesn't participate in voting and does not store any history except block headers. But they will observe directly the votes for all new blocks, and will hold a mempool of transactions and will observe those transactions getting included into blocks, so don't need separate proofs. In our case that will be quite convenient since x-shards only care about transactions in the latest block, so a node that can effectively receive all new blocks from the p2p network and notify the validator of new transactions - but without using any storage or needing any stake/authority - will work perfectly fine. The advantage is that it avoids having to configure the "light client" to connect to specific fullnodes and trust them, as the light node can observe the consensus happening directly, so transactions cannot be censored from it by any individual node.

DrZoltanFazekas commented 1 year ago

Ok, in my definition light client = lightweight node. It only keeps track of block headers but not the transactions in the blocks (otherwise it would be a full node and not light node). It needs to request the transactions (via subscriptions) from either a trusted full node without Merkle proofs or a trustless full node with Merkle proofs.

I'm not sure if a lightweight node maintaining the transaction pool of every bridged-over x-shard is light enough or it's actually already a full node as it does not only have the block headers but the transactions too. Btw, a full node is not necessarily a validator i.e. it does not have to participate in the consensus.

rrw-zilliqa commented 1 year ago

Bear in mind, FWIW, that not every X-Shard will be trustable, so just because your X-Shard says "transfer USDC$1m to Fred" doesn't mean you should do it. A corollary is that bridges between X-Shards are untrusted.

I don't anticipate PoS shared-security XShards being of much use. I would think most XShards will be secured by their own independent staking or PoA - I'll write an RFC on this, so I don't think many validators will be handling very many XShards at the same time - that said, validators will tend to want to do this up to their performance limit so as to claim the rewards for doing so.

theo-zil commented 1 year ago

Would be great to have some clarity on this because the way I see it, whether we want shared PoS or not might significantly affect the complexity required. If we go for independent security, and eschew shared PoS, then my list of tasks above should lead to a working deployable PoC and feels very achievable in maybe a couple of months or so.

Way I see it, which security model we use will depend more or less entirely on the business cases for our chain, so @rrw-zilliqa I assume you'd have the most visibility on what we really need here. Looking forward to reading your thoughts in the RFC.

rrw-zilliqa commented 1 year ago

Yes - sorry; you are quite right. Annoyingly, I am schmoozing for most of the rest of the day, but will write up ASAP.

The take-away will be:

So basically, the rules about validation for XShards are programmable. You could implement full security staking as @DrZoltanFazekas suggests, or PoA, or base the shard's security on how many clown costumes each validator owns...

rrw-zilliqa commented 1 year ago

(whilst killing a Wendigo, obv)