Sharding strategy - Githubissues

theo-zil commented 1 year ago

The agreed sharding strategy is as follows:

Each shard runs its own consensus, building its own separate blockchain.
To communicate/bridge, a validator for a given shard connects to a trusted client for another shard.
- This could be a light client, if you aren't a validator for the other shard. If you are, it could also transparently be a full node.
- The client is assumed trusted, because you run it yourself.
The validator notifies the client of a subscription. When the bridged-from shard has any transactions that match the subscription included in a finalised block, the client notifies the bridged-to validator, who can then include that information in its own next block.
The topology would primarily be ad-hoc. The ZQ2 main shard can serve as the hub, and provide message relaying functionality between any two shards on the network; but any shards which wish to communicate more closely together, and with lower latency, may bridge directly. (One could imagine perhaps other, third-party "hub shards" for specialised shard star sub-networks, e.g. to take load off the Main Shard.)
A shard's bridges would be specified in the shard's definition contract and would form part of the requirements to run a validator for the shard.
Connecting to the Main Shard is likely to be mandatory, at least to e.g. retrieve the validator set of any shards you're bridging to, even though in terms of message passing, one could imagine a shard not caring about connecting to the hub.
The security model will be a hybrid PoA and shared PoS. A shard in PoA mode will consist of trusted validators with no stake. A shard in PoS mode will need to make use of shared security, thus it will need to be validated by at least 3/4th of the total network stake to be secure.

There are three outstanding questions here:

What is the communication method between the two clients? Run everything in the same node software, with intra-process communication? Or expose the necessary functionality as RPC APIs, which would allow them to run as separate processes? Note that the communication should be low latency, but at the same time if the processes are separate I don't think it should be assumed that they run on the same machine. 1.a) Are timing issues a problem? (E.g. Node 1 from shard S observes a finalized block h' in shard S', acts on that information, and includes transaction a in block A at height h. Node 2's light client has some network delays, and when Node 2's validator observes proposal A, it has not yet confirmed the finalisation of h' in S' therefore it cannot validate transaction a and rejects the proposal. Is that fine to just let happen as-is?)
What is the message format like? Do we just allow any arbitrary data to be included, and transactions on the bridged-to shard can then access it and parse it as desired (e.g. allow transactions to stay pending in the mempool for some period of time, until some confirmation from another shard is received)? Or do we specify a protocol that e.g. requires encoding a specific smart contract call on the receiving shard? Existing standards may be worth evaluating here, e.g. XCM.

At a high level, the following then needs to be implemented to enable at least PoA sharding:

[ ] Subscription mechanism: Nodes should accept subscription requests, and notify subscribers of any transactions or messages that match the subscription's filters.
[ ] Subscription ingestion: Nodes should be able to act on responses to subscriptions, allowing transactions to get included in blocks based on the information recieved. Once this and the subscription mechanism are both implemented, this will allow an MVP sharded network to run, if you are running your own validators for every shard.
[ ] Light client: We should implement a light version of the node, which does not participate in consensus voting but does follow along the chain tip and validate new blocks. It will probably need to download all transactions too to check against its subscriptions. (Does it need transaction broadcasting functionality? I don't think so.) Once implemented, this will allow a distributed sharded network where you do not need to validate every single shard.
[ ] Shard definition contract: To make it into a coherent network, we should implement the smart contracts that can be used to construct child shards, with the initial parent-less shard being the Main Shard. This will also need to implement a selection mechanism for PoS shards (since scaling capacity is limited for these). Once implemented, this will allow specifying a full ZQ2 network with an organised hierarchy of shards.
[ ] Message forwarding: We should specify and implement to implement (likely through a smart contract) functionality for message forwarding, to allow shards (such as the Main Shard) to act as a hub between two non-directly-bridged shards. More generally, this will enable any network topology between shards, rather than only direct p2p. Once implemented, this will allow network-wide cross-shard communication.

Timeline: I estimate that at least the subscription mechanism should be completed in the near future, so we can start using an actual sharded architecture. On the face of it, my very rough estimate is that all four steps should be doable at least in a PoC shape before testnet in September, assuming we don't discover that we require significant additional complexity.

theo-zil commented 1 year ago

Upon further discussion with @DrZoltanFazekas, the shared PoS security model requiring shards to get validated by a majority of the network actually makes me question my assumptions as to how exactly should shards be ran. Specifically pertaining to question 1 above, "what is the communication method" - if we assume separate nodes/clients, that's great for PoA, but even just having 100 PoS shards would become extremely cumbersome then if validators need to run 100 different clients.

We'd need to try to consider a way to run as many nodes as possible concurrently and efficiently, to allow validators to participate in as many shards as possible. That probably means running them in-process, for starters, and may need even more thought put in to make feasible.

This sounds like it'll likely add significant complexity over the PoA-only shards.

DrZoltanFazekas commented 1 year ago

Great summary above @theo-zil

DrZoltanFazekas commented 1 year ago

I have a comment on

Connecting to the Main Shard itself may or may not be optional

I assume connecting to the main shard will be mandatory in order to retrieve the validator sets of the bridged-from shards which is necessary to check if the block headers of those shards are valid (i.e. co-signed by 2/3 of those shards' validators)

theo-zil commented 1 year ago

That's a great point, edited my summary.

DrZoltanFazekas commented 1 year ago

1.a is a tradeoff between latency of cross-shard messages and liveness of the shard's consensus, as mentioned in the RFC. By introducing a delay (i.e. the block proposer on shard S includes a block h' from shard S' that was finalized a few slots ago instead of the "newest" block h'' that was finalized in the last slot of shard S') we can give all validators more time to get notified about block h' and vote on the proposal referring to it on shard S.

DrZoltanFazekas commented 1 year ago

If validators do not run a full node of the bridged-over x-shard but only a light client which connects to a untrusted full node, the notifications about the subscribed transactions must carry Merkle proofs of those transactions. This is part of the light client protocols because light clients only know the block headers which contain a Merkle root, and must verify if the transactions the full node notifies them about are indeed included in a block based on the Merkle root in that block and the Merkle proof received along with the transaction.

theo-zil commented 1 year ago

Indeed that's true, the light client won't trust the full nodes. Your x-shard validator will trust the light client so your x-shard can still eschew all proofs, you just gotta make sure the light client verifies them.

Note that another option is running a light node, which effectively follows consensus, and simply doesn't participate in voting and does not store any history except block headers. But they will observe directly the votes for all new blocks, and will hold a mempool of transactions and will observe those transactions getting included into blocks, so don't need separate proofs. In our case that will be quite convenient since x-shards only care about transactions in the latest block, so a node that can effectively receive all new blocks from the p2p network and notify the validator of new transactions - but without using any storage or needing any stake/authority - will work perfectly fine. The advantage is that it avoids having to configure the "light client" to connect to specific fullnodes and trust them, as the light node can observe the consensus happening directly, so transactions cannot be censored from it by any individual node.

DrZoltanFazekas commented 1 year ago

Ok, in my definition light client = lightweight node. It only keeps track of block headers but not the transactions in the blocks (otherwise it would be a full node and not light node). It needs to request the transactions (via subscriptions) from either a trusted full node without Merkle proofs or a trustless full node with Merkle proofs.

I'm not sure if a lightweight node maintaining the transaction pool of every bridged-over x-shard is light enough or it's actually already a full node as it does not only have the block headers but the transactions too. Btw, a full node is not necessarily a validator i.e. it does not have to participate in the consensus.

rrw-zilliqa commented 1 year ago

Bear in mind, FWIW, that not every X-Shard will be trustable, so just because your X-Shard says "transfer USDC$1m to Fred" doesn't mean you should do it. A corollary is that bridges between X-Shards are untrusted.

I don't anticipate PoS shared-security XShards being of much use. I would think most XShards will be secured by their own independent staking or PoA - I'll write an RFC on this, so I don't think many validators will be handling very many XShards at the same time - that said, validators will tend to want to do this up to their performance limit so as to claim the rewards for doing so.

theo-zil commented 1 year ago

Would be great to have some clarity on this because the way I see it, whether we want shared PoS or not might significantly affect the complexity required. If we go for independent security, and eschew shared PoS, then my list of tasks above should lead to a working deployable PoC and feels very achievable in maybe a couple of months or so.

Way I see it, which security model we use will depend more or less entirely on the business cases for our chain, so @rrw-zilliqa I assume you'd have the most visibility on what we really need here. Looking forward to reading your thoughts in the RFC.

rrw-zilliqa commented 1 year ago

Yes - sorry; you are quite right. Annoyingly, I am schmoozing for most of the rest of the day, but will write up ASAP.

The take-away will be:

XShards exist
They are controlled by a contract (probably one big "here are our shards" ENS resolver, pointing to individual contracts for the shards)
To be a validator in an XShard you need to be accepted by the appropriate view function, which will also tell you what your stake is (might want to put some conditions on when this changes?)
You as a validator op then bid for the XShards you want to validate.
Something (probably the shard contract?) decides who it wants to validate.
You then get on and validate those shards with that stake for this epoch (which will be programmable if we can)
I will think about how we do in-shard stake validation (probably via a reverse bridge? or a corresponding Shard contract in the shard - any ideas?)

So basically, the rules about validation for XShards are programmable. You could implement full security staking as @DrZoltanFazekas suggests, or PoA, or base the shard's security on how many clown costumes each validator owns...

rrw-zilliqa commented 1 year ago

(whilst killing a Wendigo, obv)

Zilliqa / zq2

Sharding strategy #178