Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
327 stars 206 forks source link

more efficient state storage #254

Closed warner closed 4 years ago

warner commented 5 years ago

Dean managed to kill a local testnet with a 165MB swingstate.json, which caused a V8 out-of-memory error during JSON.parse (provoked by a loop that kept modifying pixels every 10 seconds for about 3000 iterations). There are three things we should do to improve this:

I don't know that CBOR is the way to go, but we're also looking for a serialization scheme for the message arguments that can accomodate binary data (hashes, etc) without the goofy "base64 all the things" that a lot of projects use as a fallback. Given our community, CapnProto or Protobufs is plausible, but we rarely have schemas for the data, so we'll need their self-describing modes, and I don't know how well they handle that style.

Another issue is that writing out the whole state at any point seems like a large operation, and it'd be nice to do something more incremental. We originally tried storing the state in cosmos-sdk "keepers", which nominally made smaller changes, but the overall efficiency hit of recording lots of little internal changes (hundreds per turn) cratered the performance (fixed in https://github.com/Agoric/SwingSet/issues/94 by switching back to one big JSON.stringify per big-turn).

Once our kernelKeeper/vatKeeper data formats settle down a bit (https://github.com/Agoric/SwingSet/issues/88 is rewriting them), we might be able to do something more sophisticated: record deltas in a JS array as well as making changes to a big table, and then save-state becomes "write the deltas to a separate file" until we've written enough of them to warrant rewriting the main file instead. Assuming we don't mess up the consistency of the two representations, that should allow the incremental writes to be pretty small.

warner commented 5 years ago

Testnet-1.13.2 was killed by the OOM-killer (Out Of Memory) after about 3000 pixel-painting messages. The last kernel checkpoint was 218_878_899 bytes long, and the mailbox state was 10_107_407 bytes long. Node.js defaults to a 1.5GB heap, and the process was running on a host with 2GB of RAM. The last mailbox took 2.5s to serialize, and the kernel+mailbox checkpoint took 10s to write to disk.

Unfortunately the kernel checkpoint was truncated by a non-atomic write-file routine, so we don't know exactly how long the transcript was, but I expect it was some small multiple of Dean's 3000 messages. I don't know why the mailbox was so big: Dean's client should have been ACKing everything, so have to imagine there was another client configured at the beginning which then disconnected, so the mailbox was accumulating gallery-state updates sent to the unresponsive client. The new pubsub technique (#64) should fix that.

warner commented 5 years ago

Capnproto is kind of aimed at parsing data out of a serialized buffer with minimal copies, which isn't easy to take advantage of in something like Javascript. It seems better suited for communication than for state persistence.

I proposed using SQLite for storage: it has strong transaction semantics, a schema, and stores everything in a single statefile. The kernel.getState() function, which currently returns a string, would have to be rewritten to accept a suitably-protected endowment that allows some number of SQL statements to be run. Or, the kernel could accumulate proposed state changes (as data) and then we use a function to extract this list from the kernel, after which an external function applies them as INSERTs or whatever to the sqlite file (wrapped in a transaction).

@dtribble pushed back against the overhead of creating string-ish SQL statements.

The overall performance improvements to (somehow) implement are:

michaelfig commented 5 years ago

I suggest having a look at https://github.com/Level/level

Note especially that the storage-like interface is optimisable with an atomic batch() API.

katelynsills commented 5 years ago

I don't know if this is correct but I came across it today and it seems applicable: https://twitter.com/el33th4xor/status/1164215472784584709?s=20

Emin Gun Sirer says that "LevelDB does at most a few hundred tps, sustained. "

michaelfig commented 5 years ago

I wonder if a batch() call counts as one Transaction in the TPS, or as many?

katelynsills commented 5 years ago

I wonder if a batch() call counts as one Transaction in the TPS, or as many?

I'm not sure how Emin is counting things, but it looks like people are pushing back: https://twitter.com/DiarioBitcoin/status/1164259484505559040?s=20 https://twitter.com/MarkTravis15/status/1164239465625210880

Emin says:

"Indeed, there's a difference between DB benchmarks and crypto benchmarks. The latter exhibit poor locality."

Screen Shot 2019-08-22 at 11 32 55 AM

Tony Arcieri has some recommendations here: https://twitter.com/bascule/status/1164319302515691520

warner commented 5 years ago

For the way SwingSet manages state, I think we only need one transaction per block (and blocks are created about once every 5 seconds). We shouldn't be committing anything smaller than a block at a time. Reads are another question, but I imagine it's easier to get higher throughput on reads than on writes.

warner commented 5 years ago

Dean and I came up with a plan:

dtribble commented 5 years ago

David Schwartz talked a little about XRPLedger's DB usage. They tried RocksDB first, and it seems to be a big improvement over LevelDB. (LevelDB was really client focused, and we need on more server focused).

But they needed more because of scale, and so developed a DB optimized for their ledger use case: https://github.com/vinniefalco/NuDB. Some discussion about it here: https://www.reddit.com/r/cpp/comments/60px64/nudb_100_released_a_keyvalue_database_for_ssds/df8i6rk/?context=8&depth=9

Given the RocksDB has transactions and a few other relevant features, it's worth looking at it first.

dtribble commented 5 years ago

A feature of Rocks DB that appears very relevant:

Support for Optimistic transactions and both sync and async mode in the Node integration (https://github.com/dberesford/rocksdb-node) also seem relevant.

warner commented 5 years ago

I think I'm going to start with LevelDB, since the wrapper package (leveldown) looks more mature/supported/widely-used than the corresponding wrapper for rocksdb (leveldown has ~100x the weekly downloads of rocksdb). And the extended features are only available through the rocksdb-node package, which has a giant "NOT MAINTAINED" warning on the README.

I don't think we need super-duper performance or those extended features, at least for now. We'll have one transaction every 5 seconds (one per block). But I'm happy to revisit this once we've written some performance benchmarks and get some concrete data.

warner commented 4 years ago

After Agoric/cosmic-swingset#109, the next lowest hanging fruit is probably to special-case vat transcripts. We need to read them at startup, but after buildVatController() returns, we never need to read those keys again (although we do constantly write to them, and we both read and write the v$NN.t.nextID field to know what v$NN.t.$MM "MM" keys to use). Maybe we could put the transcript lines in a separate file, maybe one file per vat. They should always be read in order, which might help avoid memory overhead as we ready them back in.

But going all the way to a proper synchronous DB would be better.

warner commented 4 years ago

in the old repo. this was cosmic-swingset issue 65