dckc commented 3 years ago

Describe the bug

While there is a practice of sharing informal snapshots, the only in-protocol way to join an Agoric chain, currently, is to replay all transactions from genesis; this may take days or weeks. Contrast this with the norm in the Cosmos community:

With block sync a node is downloading all of the data of an application from genesis and verifying it. With state sync your node will download data related to the head or near the head of the chain and verify the data. This leads to drastically shorter times for joining a network. -- State Sync | Tendermint Core

Other blockchain systems have similar features. In Bitcoin and Ethereum, software releases include a hash of a known-good state; this way, new nodes can download a state that is not more than a few months old and start verifying from there.

Design Notes

[ ] consensus on swingset kernel DB state: currently, the swingset DB state is not part of consensus; only the sequence of messages. Mnemonic: "you can think whatever you want, as long as you say the same thing that everyone else says". Fast sync most likely requires including (most of) the KB state in consensus Merkle tree proofs.
[ ] consensus on xsnap snapshots: currently, XS snapshots are not part of consensus; we don't require that all validators deterministically get exactly the same bytes in their snapshots. (In particular, #2776 is an observed case of non-determinism in snapshots). Fast sync most likely requires that we include snapshots in consensus.
- [ ] cosmos-sdk hooks to publish swingset state: baking our own system is undesirable; we just need some hooks to be able to leverage the Cosmos/Tendermint mechanisms for shipping states from select RPC nodes to the joining node (https://github.com/cosmos/cosmos-sdk/issues/7340#issuecomment-913092871)

cc @michaelfig @erights

rowgraus commented 3 years ago

Conservatively, I am putting it on phase 1. Feel free to postpone it as you see fit.

Tend to agree unless there are good arguments for postponing this

dckc commented 3 years ago

@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain, we can postpone this to a later milestone.

possible optimization:

[ ] replay from snapshot in parallel

(I think we only replay on-line vats, of which there is a bounded number, so that optimization doesn't seem worthwhile)

Tartuffo commented 2 years ago

First need to measure how long it currently takes based on estimated number of blocks, and or calc the blocks / time of recovery. Create sub-ticket for this initial measurement.

dckc commented 2 years ago

some mainnet0 data shows new nodes should eventually catch up, but it takes a long time. https://github.com/Agoric/agoric-sdk/issues/4106#issuecomment-1059423368

The validator community seems to prefer informal snapshot sharing. https://github.com/Agoric/testnet-notes/issues/42

mhofman commented 2 years ago

A few quick thoughts:

Without state sync, catching up a new validator relies on the same execution paths as a live execution. Whatever work we would do to optimize the time it takes to catch up from scratch, especially at the SwingSet / JS level, would be optimization we'd do for normal execution.
That means if we ever get to a fairly high utilization* (regardless of parallelization), it won't be possible to catch up by mere replay in a meaningful amount of time (unless you throw a lot more compute power at the catchup node, to reduce utilization)
During the catch up time, the chain continues to make progress, which has to be caught up to as well. IOU a formula

*: My understanding is that the meaningful number to measure is the amount of time spent in Swingset, compared to time elapsed since genesis. That gives Swingset utilization. If you consider the cosmos processing to be negligible comparatively, you can then calculate time it'd take to rebuild all the JS state through catch up. It also gives a lower bound.

erights commented 2 years ago

Can we replay each vat separately and in parallel?

michaelfig commented 2 years ago

Can we replay each vat separately and in parallel?

We need "state sync" to jump to a snapshot of the kernel data close to the current block or else we can only replay then verify a single block at a time since genesis. That's really slow, even if we do more in parallel.

dckc commented 2 years ago

@warner further to the discussion we just had about trade-offs between performance and integrity of snapshots, as I mentioned, our validator community is doing some informal snapshot sharing currently: Agoric/testnet-notes#42.

I looked around and found that it seems to take about 3.5min of downtime to do a daily mainnet0 snapshot. Some follow-up step to make the snapshot available seems to take significantly longer; I'm not sure what going on in there...

---------------------------

|2022-03-14_01:00:01| LAST_BLOCK_HEIGHT 4103449
|2022-03-14_01:00:01| Stopping agoric.service
0
|2022-03-14_01:00:01| Creating new snapshot
|2022-03-14_01:03:33| Starting agoric.service
0
|2022-03-14_01:03:33| Moving new snapshot to /home/snapshots/data/agoric
155G    /home/snapshots/snaps/agoric_2022-03-14.tar
|2022-03-14_02:19:47| Done
---------------------------

|2022-03-15_01:00:01| LAST_BLOCK_HEIGHT 4116868
|2022-03-15_01:00:01| Stopping agoric.service
0
|2022-03-15_01:00:01| Creating new snapshot
|2022-03-15_01:03:27| Starting agoric.service
0
|2022-03-15_01:03:27| Moving new snapshot to /home/snapshots/data/agoric
156G    /home/snapshots/snaps/agoric_2022-03-15.tar
|2022-03-15_03:06:14| Done
---------------------------

-- https://snapshots.stake2.me/agoric/agoric_log.txt

warner commented 2 years ago

@Tartuffo and I talked, we think that for MN-1, the best we can do is to contribute to the informal state sharing: we run a follower, once a day we stop it, make a copy of the state directory, start it back up again, then publish the state. Maybe we set up a copy-on-write filesystem (like ZFS) and make a filesystem snapshot just after the DB commits happen, to avoid the downtime (which doesn't really matter until the copy starts to take a long time). New validators can choose to use our published state, or somebody else's, but in either case they're vulnerable to a sleeper-agent -type attack by the provider (e.g. the corrupted state vector includes vat code which behaves the same as normal until some trigger event, then sends lots of tokens to the attacker, and the provider waits until enough voting power is using the corrupted state before sending the trigger). Or they can start from scratch and "take the long way around", which takes a lot of time but does not expose them to that vulnerability.

mhofman commented 2 years ago

FYI, I'm still hopeful we'll manage to make our snapshots deterministic by then, so that they can at least informally be verified. Moddable has been making good progress on that front.

warner commented 2 years ago

@michaelfig walked me through an idea today which sounds promising. Basically we shadow the kernelDB's kvStore in the IAVL tree, by modifying the SwingStore to accumulate a copy (in RAM) of all the kvStore writes and deletes as a block is running, then apply them to the IAVL tree after the LMDB commit (like we used to do with the block buffer, before we switched it to accumulate those deltas in an un-committed LMDB transaction). Then we augment the other portions of SwingStore (snapshot storage, transcript storage) to submit hashes of their entries to the IAVL tree.

This would get us a hashed/consensus copy of the entire kvStore, and consensus on the contents of the other parts. The writes would take some time, but since the kernel isn't reading anything out of IAVL during normal operation, we aren't slowing down reads (in particular we aren't doing roundtrips for reads, which @michaelfig says was the real performance killer). And we're doing the writes in a single big batch, which is probably optimal.

He tells me that the cosmos "snapshot server" has come a long way, and that it's usable as a follower node whose only job is to prepare hash-verifiable data for new validators to fetch. There are apparently hooks to allow some data to be kept outside the IAVL tree, as long as its hash is stored in IAVL. We could use this for the XS heap snapshots (the snapStore). And the transcripts could be validated with a layered hash (first store h1=hash(t1), then replace it with h2=hash(h1+t1), then h3=hash(h2+t2), etc), since the client is always going to be fetching the entire transcript (even after upgrades, we retain the old transcripts), and it can perform the hash in-line as it receives the ordered entries.

When a new validator comes up, it talks to a snapshot server and finds out the most recent snapshot it can provide (maybe a week old). Then it does the light client thing and fetches enough block headers to validate the block for that snapshot (limited by the usual unbonding time issues). Then it fetches data from the snapshot server and compares the data against the Merkle proofs until it's got a full copy of the swingstore. We use this swingstore instead of initializing a new DB. Then we just resume the kernel from that stored state vector.

This doesn't provide a huge win unless/until we get our XS heap snapshots into the state vector, which of course depends upon them being consistent (which @mhofman is herding vigorously). But even without that, we'd save new validators the time it takes to reapply all the cosmos transactions, and we'd reduce the kernel work to replaying all the vats (not all the kernel-bound transactions), and replaying vats could be done in parallel.

warner commented 2 years ago

I'm not sure how much of this work would be handled in #5542 and how much needs to happen just on the swingset side. My hunch is that this is mostly a host-application issue: swingset just does DB writes as usual (the host is responsible for providing a swing-store that squirrels away extra copies of the data, if it wants, and hashing them into some consensus state), and new swingsets are just handed a pre-initialized DB.

This ticket might need to cover swing-store providing APIs that make it easier for cosmic-swingset to pre-populate a DB with data from the state sync dataset. In addition to the current dependencies (including XS snapshots becoming deterministic).

mhofman commented 2 years ago

I think hashing into consensus state is still a problem for the swingStore, isn't it? However from what I recall of lmbd, we could start a thread opening the DB at a given point, start a read transaction, and dump/export the content of the DB and hash it. The consistency model of lmdb should guarantee that even with concurrent writes, the data read will be from the snapshot taken when the transaction was initiated. That would allow the state sync data to be generated in parallel across multiple blocks.

Edit: I re-read the post above, and it documents another approach we had talked about, which is to duplicate the swingstore in the IAVL tree. I thought we had since shot that down for size constraints.

dckc commented 2 years ago

@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain ...

5507 notes that at one point, not only was replay too slow for nodes to ever catch up, but nodes that were initially in sync could not keep up. I think we have made a number of performance improvements since then. Stay tuned for further measurements.

dckc commented 2 years ago

one validator notes:

the post upgrade mainnet snapshot is already nearly 2GB in size.

warner commented 2 years ago

One data point: I watched a node crash today, it missed about 200s before getting restarted. The restart took 2m10s to replay vat transcripts enough to begin processing blocks again, then took another 33s to replay the 95-ish (empty) missed blocks, after which is was caught up and following properly again.

The vat-transcript replay time is roughly bounded by the frequency of our heap snapshots: we take a heap snapshot every 2000 deliveries, so no single vat should ever need to replay more than 2000 deliveries at reboot time, so reboot time will be random but roughly constant (depends on deliveryNum % 2000 summed across all vats).

Note that this doesn't tell us anything about how long it takes to start up a whole new validator from scratch.

mhofman commented 2 years ago

After discussing state-sync the other day, @arirubinstein mentioned validators leverage state sync to work around a cosmos DB pruning issue: they start a new node state syncing from their existing node to prune their DB.

In case for some reason we can't figure out state sync by the time the DBs grows out too large, we should check if the following rough hack may work:

shut down node at height of a block N that won't be pruned from cosmos DB.
Make copy of swingset state dir
Restart node
Start new node from swingset state copy, and using cosmos state-sync at same block height N to re-populate the IAVL tree.

For consistency protection, Swingset saves the block height it last committed, and checks that the next block it sees is either the next block N + 1, or the same block N (in which case it doesn't execute anything but simply replays calls it previously made back to the go / cosmos side).

dckc commented 1 year ago

... as long as catching up is 3x to 5x faster than the running chain ...

a recent data point: 26hrs to catch up on 26 chain days. So 24x.

Agoric / agoric-sdk

agd does not support joining with state sync #3769

Describe the bug

Design Notes

5507 notes that at one point, not only was replay too slow for nodes to ever catch up, but nodes that were initially in sync could not keep up. I think we have made a number of performance improvements since then. Stay tuned for further measurements.