MystenLabs / sui

Sui, a next-generation smart contract platform with high throughput, low latency, and an asset-oriented programming model powered by the Move programming language
https://sui.io
Apache License 2.0
5.97k stars 11.09k forks source link

[fastx] How to facilitate Bulk Sync between authorities, authorities and full replicas #194

Closed gdanezis closed 2 years ago

gdanezis commented 2 years ago

Authorities need re-assurance at some point they have ingested all updates from other authorities up to a point. Similarly, full replicas (ie. nodes that try to replicate all state, but do not validate yet) also need to track the full state of the system, and to do so need to know they have received all updates from authorities up to a point. This design task is about designing this mechanism.

gdanezis commented 2 years ago

@huitseeker has written on some ingredients we can use to solve this, here: https://docs.google.com/document/d/1LXVNqhHSbY499G-seenTYCPRHXormZLK9BXzMSmG7L4/edit?usp=sharing

Key readings:

gdanezis commented 2 years ago

Here are my current thoughts on Sync design: https://docs.google.com/document/d/10SpQluu2Rpc9loUBbxyaCfXqCtkB-FwIZuucnDF_DSQ/edit?usp=sharing @sblackshear , @huitseeker any comments welcome.

huitseeker commented 2 years ago

The core of the argument against the way you're implementing the push-pull gossip (and then using it to argue that the byzantine authorities never flood anybody else) is that:

QED: honest authorities must be open to receive, up to fairness, a stream of transaction hashes claimed to be newly produced by any other authority, including Byzantine ones. Those messages must be received as long as they are fairly received (i.e. a byzantine authority does not produce more data than any other would) and be of a size at least equal to that amount of transaction hashes produced by the median honest authority over the same period of time.

If an honest authority produces V TPS, and the corresponding stream of transactions hashes are a V/n data stream, then honest authorities must therefore be open to receiving V/n bytes per second from a Byzantine authority, without constraining those hashes to match valid or new transactions. As a consequence, considering there are f Byzantine authorities, the amount of ingress that the honest node must accept is V(f/n), which should be sufficient to cause a significant amount of slowdown when f is concretely large (average TPS: 512 bytes, average has: 32 bytes, ratio n = 16).

huitseeker commented 2 years ago

There's quite a bit more on IBLT-based solutions, notably:

gdanezis commented 2 years ago

Here I discuss one more ingredient we can use for end of epoch or state commitment sync, to get a agreed checkpoint with most of the work done asynchronously and a tiny bit synchronously: https://docs.google.com/presentation/d/1TSs1FPC9INZMpipP1vP2dK7XMmOXtHk5CLWFKO6s7p8/edit?usp=sharing

huitseeker commented 2 years ago

Here's a new design that aims at producing a checkpoint as well, but hints at limited amounts of state synchronization along the way: https://docs.google.com/document/d/17KjGGLl8L8MSkJ0RBS5au-jR4Jp475BFi_1-rJCS6mk/edit?usp=sharing

huitseeker commented 2 years ago

In a neighboring PR, @gdanezis was asking:

I am still unsure how a facility for a (say byzantine) node to commit to its sequence, and to allow clients to ask [for stuff] can lead to any kind of DoS or resource exhaustion.

It's not about the pull-based logic

The notion of "asking once you've seen the commitments", like any pull-based logic, is not a subject of concern. In fact, it's very likely that the commitments themselves, if they point me to finite-size payloads of data that each carry authentication, make me safe in asking for unseen data productively. Concretely, once I ask the server using a hash it signed over, it had better deliver me the correct and valid payload, or I'll have the cryptographic assets to reveal it as Byzantine.

The issue is that upstream of that, commitments without throughput limits in themselves are a "damned if you do, damned if you don't" logic: if as a node I don't subscribe to enough commitment streams, I'm likely to be unable to succeed in syncing with the network, and if I do subscribe to Byzantine authorities, I'm likely to be bogged down into improductive resource consumption (through the commitments alone) until I can determine for sure that the commitment data I'm receiving from a subscribee is not valuable.

Replay attacks

The problem is that determining the value of commitment data itself (without committing to further resource consumption) is very hard, and quickly becomes a cat-and-mouse game of how much ingress throughput a subscribee can steal from the subscriber without revealing it's Byzantine. That's because there's not much I can do with commitments except read them and see if I already have something for them, by which time the damage is done. The typical attack here is not a DDoS, it's a replay attack, where the replay is information that's 💯 valid but known by the Byzantine node to not make the client node progress.

In itself, this commitment replay attack is an amplification of the resource consumption of the sync process, and because it occurs at the expense of non-sync activities of the node (e.g. ingesting new transactions), it becomes a global throughput limit on the whole system.

One simple heuristic for clamping down on these shenanigans is to say that for a given unit of time, a client must know if there are interesting new elements to download anywhere in the system after having seen an amount of commitments of size s per server such that (3f+1) s is well below the ingress limits of the node.

It's of course a heuristic: I'm not saying it's impossible to have a protocol that works without a hard global cap on the throughput of commitments. For example, the cap could be adaptative and be readjusted periodically depending on load, just as the difficulty is for PoW blockchains. It just gets complicated quickly, and needs to account for f actors serving a commitment replay attack at the highest throughput they can get away with.

Absolutely zero structure on batch production (no max batch size, no batch frequency), though, is an additional risk because it lets any Byzantine node do whatever they want with this, and lets the subscribers only have network-based QoS as a defense (evenly divide throughput dedicated to commitment download among subscribees). Network-based QoS possible, but it's not easy to implement, and the implementations are not efficient on the most common network infrastructures today.

How to fit things into a constant?

I don't have the full story, but one thing that helps is this:

If a node has recently seen a batch of transactions, a lot of them are probably referring to transactions in the same batch, depending on each other. e.g. if I have seen (TX1) and (TX2, which depends on TX1) within the same batch, I don't need to commit to TX1 within the batch. If I commit to TX2 alone, a honest node that has neither TX1 nor TX2 will ask me for it, check TX2 is valid, and come back with an ask for TX1 (through causal pointers).

How likely is that to occur within the same batch? The Utreexo paper, Figure 2 makes me think this is a pretty frequent case for Bitcoin at least.

The question I'm asking myself is: can we structure the presentation of commitments so that we transmit commitments to useful data more sparsely?

gdanezis commented 2 years ago

I got a few more resources from Martin kleppmann:

Martin Kleppmann martin.kleppmann@cl.cam.ac.uk Thu 24/02/2022 11:18 Hi George,

That sounds good! On the problem of exchanging differences between sets with lots of overlap, there has been some interesting work on BFT set reconciliation algorithms: https://arxiv.org/abs/1905.10518 https://github.com/sipa/minisketch

If you're willing to organise the elements of the set into a hash graph, you can also use a graph reconciliation algorithm, which I've worked on: https://martin.kleppmann.com/2020/12/02/bloom-filter-hash-graph-sync.html https://arxiv.org/abs/2012.00472

Best wishes, Martin

huitseeker commented 2 years ago

I'm very familiar with Erlay, and have worked on extending + fuzzing rust bindings for minisketch. They're a nice constant-factor improvement on IBLTs that works along the same principles (this is what I was mentioning @asonnino), which we already integrated in our thinking.

I'm also very familiar with the hash graph Martin is suggesting (@gdanezis I believe I mentioned it before I joined - you then shared you had attended Martin's presentation in-person in 2019), but it works by re-constituting causal broadcast from scratch, and has a prohibitive communication complexity cost.

gdanezis commented 2 years ago

It feels like we now have a critical mass of eyes on this issue, as well as design options / ingredients / proposals, so lets try to make a plan to make incremental progress, while still thinking / designing the best mechanisms today and down the line.

Priorities

In my mind we have the following priorities:

I think that so far we (well I first of all, mea culpa) have interchangeably referred to all these with variants of sync / synchronization names (the S-word), leading to confusion. These functions are interrelated, and solving one well can support the others. In fact sometimes it feels we are going in circles since (as suggested in the bullet points above) there can be a cyclical dependency between their solutions. But I think it helps to mentally separate these needs if only to plan their development.

Summary of proposals and match to priority

I think that this is how the different stands of proposals fit into the above priorities:

Design decisions, and their nature

My aim here is to separate design decisions that impact the protocol -- namely that all authorities / clients need to agree upon for the system to function. Versus design decisions that are more flexible, in that authorities and clients can show a certain amount of discretion, and as a result we can afford to incrementally design the best techniques now, until testnet/mainnet and beyond. This is an important distinction for decentralised platform protocols: we need to impose the smallest amount of obligations upon all, and maximise the flexibility for peer innovation down the line.

Core Protocol decisions:

Non-core Protocol decisions:

Call to action: @huitseeker , @velvia , @lxfind , @laura-makdah , @sblackshear -- this is my thinking, lets have a meeting this week of the next to ensure we are all on the same page; make a schedule of when / how to make the core protocol decisions above; and also agree on what decisions should be core, and which we can just be more flexible about (non-core); and then move forward.

huitseeker commented 2 years ago

Inverted index:

Then come the hybrid proposals, both based on timestamping, to categorize some state as "behind us", and therefore help both sync (there's some state we no longer have to talk about) and agreement (there's some part of state we're confident we're done with):

There's also IIUC a topology proposal:


Personal note

Here are cross-cutting concerns I have with many of the proposals above, very much including some of mine:

lavindir commented 2 years ago

At Fastly, the first way we handled syncing of distributed rate limiting data was using bloom filters to detect differences of state. But that ended up being far more expensive than we imagined and a more naive solution that sends the interesting set on a regular basis was much faster in practice. Although we did have the advantage of idempotency, so merging data that was already known about was not an issue. The larger global cache network would communicate using a variation of push-pull gossip, and that allowed a communication to transmit globally at a speed that was indeed very fast.

velvia commented 2 years ago

@gdanezis thank you for a very comprehensive summary, that really helps. Will try to read through the spanning tree proposal before the meeting tomorrow.

What mechanism should each authority provide for services and replicas to sync with it (this is priority A). This is largely a peering decision, and one that can evolve other time. It's ok to have a v0 protocol (that say works for smaller volumes) and then provide a V1 protocol that supports more features / is more efficient. This impacts the relationship an authority and its clients. Of course, we prefer something that many will find useful as what we come out on mainnet with is likely to be the default for some time.

What I'm hoping for from data architecture perspective is that 1) Through successive state agreements, we can have an increasingly large portion of the older state that is globally agreed, and 2) This agreed upon older state can live outside of authorities, like in warm or cold storage somewhere. Thus, we want a tiered storage architecture 3) which allows the bulk of analysis, replicas, and state to be synced to happen without the involvement of authorities, which can just focus on the latest transactions.

I know this is a tangential concern but just noting it

velvia commented 2 years ago

BTW I think @gdanezis the spanning tree checkpointing is brilliant. I like it much more than using modulo math to n-way do set reconciliation. I wonder who keeps track of the graph etc. though.

There is one mode of failure I haven't seen discussed much. I suspect that, much more frequently than byzantine attacks, will be the scenario that some authority was offline for like a few minutes or whatever, during which there will be a big "hole" in its data. That would need to be rectified first to make the O(diff) algorithms efficient.

gdanezis commented 2 years ago

Edit from @huitseeker : I backed up this long comment copy-pasting the chat bar of one of our discussions from March 1st here: https://gist.github.com/huitseeker/90ea22a19309b207ee8015c1cd8fbfa4

huitseeker commented 2 years ago

State Synchronization discussions summary ===

The pre-game summary of George at this comment is a good setup for where we were (esp. proposals) before the talks on March 1st.

Problems

In a nutshell, we agree we are going to have to solve all of three problems:

We note that while a Follower can afford to trust its server, the reconciliation and agreement problems have to be designed for a Byzantine environment.

Solutions

Solving problem A is relatively uncontroversial, in that there's only one proposal to go about it (which will benefit from some of the above insight).

For problems B & C, we agree that from the moment that we know the hash of a new transaction from a certain sender, we know how to safely[^2] synchronize everything that is a causal antecedent of it, including all prior transactions for this sender. So the main obstacle is an exchange of metadata: the hashes of the latest transactions from each account.

In solving B & C, we have are several categories of proposals:

The challenge in these protocols is to make sure they are parcimonious in their bandwidth usage (by all rights, we should exchange but a tiny amount of data).

In the short term, we can just focus on just the proposals that require modifications to the core protocol (better engineered early). Only the timestamper approaches have that requirement.

The core advantage of these approaches is they can define rough segments of state that is produced at a limited rate in the first place:

Where this helps reconciliation is that instead of entering an open-ended reconciliation problem over all data from genesis, we can now have sub-protocols where we address timestamp ranges one after the other, and make synchronization incremental.

Where this helps agreement is that we can agree over fragments of the past we commit to have seen: if we assume a reconciliation mechanism that ensures delivery within some time, all honest authorities can make sets of their certificates way older than the time it takes to disseminate them, put these in a state commitment, and try to get agreement on it (2f+1 signatures).

We then discussed some details of proposal [3], namely:

At the end of the meeting, we have a positive outlook on integrating some time stamps to the protocol.

[^1]: e.g. the pool of machines could publish their collective data over a gossip network, [^2]: in a way resistant to Byzantine participants

gdanezis commented 2 years ago

This is the background discussion to the #1099 issue.