more efficient state storage

warner commented 5 years ago

Dean managed to kill a local testnet with a 165MB swingstate.json, which caused a V8 out-of-memory error during JSON.parse (provoked by a loop that kept modifying pixels every 10 seconds for about 3000 iterations). There are three things we should do to improve this:

On writing out the statefile, do a proper write-to-tempfile rename-in-place atomic update. We've seen evidence of zero-length statefiles on the cloud-side chain, which implies an error during state writeout (probably not what Dean saw but still bad). @michaelfig said he'd work on this.
Storing the state in something more compact than JSON. Maybe CBOR? The problem is that we want native code doing the encoding, not (slow) javascript. The kernel does the encoding, since we pass strings across the realm boundary, unless we want to expose a native-code/FFI object as an endowment to the kernel. So either we have the controller un-JSON and then re-CBOR into the statefile (compact but slower), or we provide a native encoding endowment to the kernel.
Dean wants streaming decode, which would reduce the memory footprint somewhat (you still hold the whole object graph in RAM, but you don't have to hold the stringified form of it all at once). I don't know if the stdlib JSON functions have a way to do that. Perhaps the FFI endowment we pick could do that.

I don't know that CBOR is the way to go, but we're also looking for a serialization scheme for the message arguments that can accomodate binary data (hashes, etc) without the goofy "base64 all the things" that a lot of projects use as a fallback. Given our community, CapnProto or Protobufs is plausible, but we rarely have schemas for the data, so we'll need their self-describing modes, and I don't know how well they handle that style.

Another issue is that writing out the whole state at any point seems like a large operation, and it'd be nice to do something more incremental. We originally tried storing the state in cosmos-sdk "keepers", which nominally made smaller changes, but the overall efficiency hit of recording lots of little internal changes (hundreds per turn) cratered the performance (fixed in https://github.com/Agoric/SwingSet/issues/94 by switching back to one big JSON.stringify per big-turn).

Once our kernelKeeper/vatKeeper data formats settle down a bit (https://github.com/Agoric/SwingSet/issues/88 is rewriting them), we might be able to do something more sophisticated: record deltas in a JS array as well as making changes to a big table, and then save-state becomes "write the deltas to a separate file" until we've written enough of them to warrant rewriting the main file instead. Assuming we don't mess up the consistency of the two representations, that should allow the incremental writes to be pretty small.

warner commented 5 years ago

Testnet-1.13.2 was killed by the OOM-killer (Out Of Memory) after about 3000 pixel-painting messages. The last kernel checkpoint was 218_878_899 bytes long, and the mailbox state was 10_107_407 bytes long. Node.js defaults to a 1.5GB heap, and the process was running on a host with 2GB of RAM. The last mailbox took 2.5s to serialize, and the kernel+mailbox checkpoint took 10s to write to disk.

Unfortunately the kernel checkpoint was truncated by a non-atomic write-file routine, so we don't know exactly how long the transcript was, but I expect it was some small multiple of Dean's 3000 messages. I don't know why the mailbox was so big: Dean's client should have been ACKing everything, so have to imagine there was another client configured at the beginning which then disconnected, so the mailbox was accumulating gallery-state updates sent to the unresponsive client. The new pubsub technique (#64) should fix that.

warner commented 5 years ago

Capnproto is kind of aimed at parsing data out of a serialized buffer with minimal copies, which isn't easy to take advantage of in something like Javascript. It seems better suited for communication than for state persistence.

I proposed using SQLite for storage: it has strong transaction semantics, a schema, and stores everything in a single statefile. The kernel.getState() function, which currently returns a string, would have to be rewritten to accept a suitably-protected endowment that allows some number of SQL statements to be run. Or, the kernel could accumulate proposed state changes (as data) and then we use a function to extract this list from the kernel, after which an external function applies them as INSERTs or whatever to the sqlite file (wrapped in a transaction).

@dtribble pushed back against the overhead of creating string-ish SQL statements.

The overall performance improvements to (somehow) implement are:

JS engine-supported heap checkpoints, to shrink the transcripts
denser encoding of kernel state (clists, object tables, promise tables, and transcripts)
GC to shrink the kernel tables (clists, object tables, promise tables)
- possibly by killing Vats more frequently
- or by engine support for weakrefs and a safe/deterministic way to react to decrefs

michaelfig commented 5 years ago

I suggest having a look at https://github.com/Level/level

Note especially that the storage-like interface is optimisable with an atomic batch() API.

katelynsills commented 5 years ago

I don't know if this is correct but I came across it today and it seems applicable: https://twitter.com/el33th4xor/status/1164215472784584709?s=20

Emin Gun Sirer says that "LevelDB does at most a few hundred tps, sustained. "

michaelfig commented 5 years ago

I wonder if a batch() call counts as one Transaction in the TPS, or as many?

katelynsills commented 5 years ago

I wonder if a batch() call counts as one Transaction in the TPS, or as many?

I'm not sure how Emin is counting things, but it looks like people are pushing back: https://twitter.com/DiarioBitcoin/status/1164259484505559040?s=20 https://twitter.com/MarkTravis15/status/1164239465625210880

Emin says:

"Indeed, there's a difference between DB benchmarks and crypto benchmarks. The latter exhibit poor locality."

Screen Shot 2019-08-22 at 11 32 55 AM

Tony Arcieri has some recommendations here: https://twitter.com/bascule/status/1164319302515691520

warner commented 5 years ago

For the way SwingSet manages state, I think we only need one transaction per block (and blocks are created about once every 5 seconds). We shouldn't be committing anything smaller than a block at a time. Reads are another question, but I imagine it's easier to get higher throughput on reads than on writes.

warner commented 5 years ago

Dean and I came up with a plan:

the kernel is given a special state storage endowment, used by the keepers
this endowment knows the kernel-state schema, rather than being a general key-value store (maybe. maybe a plain KV store would be sufficient)
the backing store is a disk-based database of some sort, with atomic batch writes
the kernel/keepers use storage to read state lazily from disk as it starts up. For each vat, it fetches a chunk of transcript commands, executes them, then fetches the next chunk, etc. So it never has to hold the whole state vector in RAM. The c-lists and tables are not fetched preemptively.
the keepers can cache state in RAM, or discard it and re-fetch it from storage as they see fit
each vat delivery makes the keepers create changes they wish to apply to the state vector at the end of the turn. The keepers (or some kernel-side component of storage must cache these changes, such that subsequent reads will see the changes (a "write-back cache"?), but not submit them yet.
if the vat causes a fatal error (syscall with invalid arguments), we want to discard all the state changes it made
when the host loop has finished driving the kernel, it tells the kernel "now is the time we save state", and the kernel tells storage to write out the accumulated deltas in an atomic commit to the backing store
maybe this takes the form of a big SQL command, maybe it's a LevelDB atomic batch/write.
using a plain KV store means we have to build composite keys on the kernel side, and the backing store is less schematized (so e.g. upgrade is harder, as there's less structure to rely upon)
maybe we can get denser storage by using an encoding scheme that knows more about our data structures
- write-to-tmpfile, atomic rename, but that requires copying a whole lot of data
- write deltas, write more deltas, build our own copy-on-write filesystem, occasional consolidation passes with an external program. sounds a lot like writing our own database, let's avoid that

dtribble commented 5 years ago

David Schwartz talked a little about XRPLedger's DB usage. They tried RocksDB first, and it seems to be a big improvement over LevelDB. (LevelDB was really client focused, and we need on more server focused).

But they needed more because of scale, and so developed a DB optimized for their ledger use case: https://github.com/vinniefalco/NuDB. Some discussion about it here: https://www.reddit.com/r/cpp/comments/60px64/nudb_100_released_a_keyvalue_database_for_ssds/df8i6rk/?context=8&depth=9

Given the RocksDB has transactions and a few other relevant features, it's worth looking at it first.

dtribble commented 5 years ago

A feature of Rocks DB that appears very relevant:

Column Families - which are essentially multiple separate key spaces in teh same DB (so separate key spaces per vat are plausible)

Support for Optimistic transactions and both sync and async mode in the Node integration (https://github.com/dberesford/rocksdb-node) also seem relevant.

warner commented 5 years ago

I think I'm going to start with LevelDB, since the wrapper package (leveldown) looks more mature/supported/widely-used than the corresponding wrapper for rocksdb (leveldown has ~100x the weekly downloads of rocksdb). And the extended features are only available through the rocksdb-node package, which has a giant "NOT MAINTAINED" warning on the README.

I don't think we need super-duper performance or those extended features, at least for now. We'll have one transaction every 5 seconds (one per block). But I'm happy to revisit this once we've written some performance benchmarks and get some concrete data.

warner commented 4 years ago

After Agoric/cosmic-swingset#109, the next lowest hanging fruit is probably to special-case vat transcripts. We need to read them at startup, but after buildVatController() returns, we never need to read those keys again (although we do constantly write to them, and we both read and write the v$NN.t.nextID field to know what v$NN.t.$MM "MM" keys to use). Maybe we could put the transcript lines in a separate file, maybe one file per vat. They should always be read in order, which might help avoid memory overhead as we ready them back in.

But going all the way to a proper synchronous DB would be better.

warner commented 4 years ago

in the old repo. this was cosmic-swingset issue 65

Agoric / agoric-sdk

more efficient state storage #254