Closed dckc closed 1 year ago
Conservatively, I am putting it on phase 1. Feel free to postpone it as you see fit.
Tend to agree unless there are good arguments for postponing this
@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain, we can postpone this to a later milestone.
possible optimization:
(I think we only replay on-line vats, of which there is a bounded number, so that optimization doesn't seem worthwhile)
First need to measure how long it currently takes based on estimated number of blocks, and or calc the blocks / time of recovery. Create sub-ticket for this initial measurement.
some mainnet0 data shows new nodes should eventually catch up, but it takes a long time. https://github.com/Agoric/agoric-sdk/issues/4106#issuecomment-1059423368
The validator community seems to prefer informal snapshot sharing. https://github.com/Agoric/testnet-notes/issues/42
A few quick thoughts:
*: My understanding is that the meaningful number to measure is the amount of time spent in Swingset, compared to time elapsed since genesis. That gives Swingset utilization. If you consider the cosmos processing to be negligible comparatively, you can then calculate time it'd take to rebuild all the JS state through catch up. It also gives a lower bound.
Can we replay each vat separately and in parallel?
Can we replay each vat separately and in parallel?
We need "state sync" to jump to a snapshot of the kernel data close to the current block or else we can only replay then verify a single block at a time since genesis. That's really slow, even if we do more in parallel.
@warner further to the discussion we just had about trade-offs between performance and integrity of snapshots, as I mentioned, our validator community is doing some informal snapshot sharing currently: Agoric/testnet-notes#42.
I looked around and found that it seems to take about 3.5min of downtime to do a daily mainnet0 snapshot. Some follow-up step to make the snapshot available seems to take significantly longer; I'm not sure what going on in there...
---------------------------
|2022-03-14_01:00:01| LAST_BLOCK_HEIGHT 4103449
|2022-03-14_01:00:01| Stopping agoric.service
0
|2022-03-14_01:00:01| Creating new snapshot
|2022-03-14_01:03:33| Starting agoric.service
0
|2022-03-14_01:03:33| Moving new snapshot to /home/snapshots/data/agoric
155G /home/snapshots/snaps/agoric_2022-03-14.tar
|2022-03-14_02:19:47| Done
---------------------------
|2022-03-15_01:00:01| LAST_BLOCK_HEIGHT 4116868
|2022-03-15_01:00:01| Stopping agoric.service
0
|2022-03-15_01:00:01| Creating new snapshot
|2022-03-15_01:03:27| Starting agoric.service
0
|2022-03-15_01:03:27| Moving new snapshot to /home/snapshots/data/agoric
156G /home/snapshots/snaps/agoric_2022-03-15.tar
|2022-03-15_03:06:14| Done
---------------------------
@Tartuffo and I talked, we think that for MN-1, the best we can do is to contribute to the informal state sharing: we run a follower, once a day we stop it, make a copy of the state directory, start it back up again, then publish the state. Maybe we set up a copy-on-write filesystem (like ZFS) and make a filesystem snapshot just after the DB commits happen, to avoid the downtime (which doesn't really matter until the copy starts to take a long time). New validators can choose to use our published state, or somebody else's, but in either case they're vulnerable to a sleeper-agent -type attack by the provider (e.g. the corrupted state vector includes vat code which behaves the same as normal until some trigger event, then sends lots of tokens to the attacker, and the provider waits until enough voting power is using the corrupted state before sending the trigger). Or they can start from scratch and "take the long way around", which takes a lot of time but does not expose them to that vulnerability.
FYI, I'm still hopeful we'll manage to make our snapshots deterministic by then, so that they can at least informally be verified. Moddable has been making good progress on that front.
@michaelfig walked me through an idea today which sounds promising. Basically we shadow the kernelDB's kvStore
in the IAVL tree, by modifying the SwingStore to accumulate a copy (in RAM) of all the kvStore writes and deletes as a block is running, then apply them to the IAVL tree after the LMDB commit (like we used to do with the block buffer, before we switched it to accumulate those deltas in an un-committed LMDB transaction). Then we augment the other portions of SwingStore (snapshot storage, transcript storage) to submit hashes of their entries to the IAVL tree.
This would get us a hashed/consensus copy of the entire kvStore, and consensus on the contents of the other parts. The writes would take some time, but since the kernel isn't reading anything out of IAVL during normal operation, we aren't slowing down reads (in particular we aren't doing roundtrips for reads, which @michaelfig says was the real performance killer). And we're doing the writes in a single big batch, which is probably optimal.
He tells me that the cosmos "snapshot server" has come a long way, and that it's usable as a follower node whose only job is to prepare hash-verifiable data for new validators to fetch. There are apparently hooks to allow some data to be kept outside the IAVL tree, as long as its hash is stored in IAVL. We could use this for the XS heap snapshots (the snapStore
). And the transcripts could be validated with a layered hash (first store h1=hash(t1)
, then replace it with h2=hash(h1+t1)
, then h3=hash(h2+t2)
, etc), since the client is always going to be fetching the entire transcript (even after upgrades, we retain the old transcripts), and it can perform the hash in-line as it receives the ordered entries.
When a new validator comes up, it talks to a snapshot server and finds out the most recent snapshot it can provide (maybe a week old). Then it does the light client thing and fetches enough block headers to validate the block for that snapshot (limited by the usual unbonding time issues). Then it fetches data from the snapshot server and compares the data against the Merkle proofs until it's got a full copy of the swingstore. We use this swingstore instead of initializing a new DB. Then we just resume the kernel from that stored state vector.
This doesn't provide a huge win unless/until we get our XS heap snapshots into the state vector, which of course depends upon them being consistent (which @mhofman is herding vigorously). But even without that, we'd save new validators the time it takes to reapply all the cosmos transactions, and we'd reduce the kernel work to replaying all the vats (not all the kernel-bound transactions), and replaying vats could be done in parallel.
I'm not sure how much of this work would be handled in #5542 and how much needs to happen just on the swingset side. My hunch is that this is mostly a host-application issue: swingset just does DB writes as usual (the host is responsible for providing a swing-store that squirrels away extra copies of the data, if it wants, and hashing them into some consensus state), and new swingsets are just handed a pre-initialized DB.
This ticket might need to cover swing-store
providing APIs that make it easier for cosmic-swingset to pre-populate a DB with data from the state sync dataset. In addition to the current dependencies (including XS snapshots becoming deterministic).
I think hashing into consensus state is still a problem for the swingStore, isn't it? However from what I recall of lmbd, we could start a thread opening the DB at a given point, start a read transaction, and dump/export the content of the DB and hash it. The consistency model of lmdb should guarantee that even with concurrent writes, the data read will be from the snapshot taken when the transaction was initiated. That would allow the state sync data to be generated in parallel across multiple blocks.
Edit: I re-read the post above, and it documents another approach we had talked about, which is to duplicate the swingstore in the IAVL tree. I thought we had since shot that down for size constraints.
@dtribble suggests that as long as catching up is 3x to 5x faster than the running chain ...
one validator notes:
the post upgrade mainnet snapshot is already nearly 2GB in size.
One data point: I watched a node crash today, it missed about 200s before getting restarted. The restart took 2m10s to replay vat transcripts enough to begin processing blocks again, then took another 33s to replay the 95-ish (empty) missed blocks, after which is was caught up and following properly again.
The vat-transcript replay time is roughly bounded by the frequency of our heap snapshots: we take a heap snapshot every 2000 deliveries, so no single vat should ever need to replay more than 2000 deliveries at reboot time, so reboot time will be random but roughly constant (depends on deliveryNum % 2000
summed across all vats).
Note that this doesn't tell us anything about how long it takes to start up a whole new validator from scratch.
After discussing state-sync the other day, @arirubinstein mentioned validators leverage state sync to work around a cosmos DB pruning issue: they start a new node state syncing from their existing node to prune their DB.
In case for some reason we can't figure out state sync by the time the DBs grows out too large, we should check if the following rough hack may work:
For consistency protection, Swingset saves the block height it last committed, and checks that the next block it sees is either the next block N + 1, or the same block N (in which case it doesn't execute anything but simply replays calls it previously made back to the go / cosmos side).
... as long as catching up is 3x to 5x faster than the running chain ...
a recent data point: 26hrs to catch up on 26 chain days. So 24x.
Describe the bug
While there is a practice of sharing informal snapshots, the only in-protocol way to join an Agoric chain, currently, is to replay all transactions from genesis; this may take days or weeks. Contrast this with the norm in the Cosmos community:
Other blockchain systems have similar features. In Bitcoin and Ethereum, software releases include a hash of a known-good state; this way, new nodes can download a state that is not more than a few months old and start verifying from there.
Design Notes
cc @michaelfig @erights