"kernel activity hash": add some amount of swingset state into the cosmos AppHash

warner commented 3 years ago

What is the Problem Being Solved?

In trying to debug #3428, @mhofman and I compared slog traces of two different validators to confirm that they were running the same cranks. He found variation in the vbank messages (#3435), and we also observed differences in the GC Actions sent into vats between validators which had or had not reloaded a vat from snapshot (due to #3428 causing different GC syscalls to be made slightly earlier).

We've been kind of on the fence about how much variation to allow in the swingsets running on different validators. Currently, the only swingset variations that will cause a validator to break consensus are outbound comms/IBC messages (written into the cosmos-sdk "swingset" Module's state) and messages sent to the vbank Module (also saved in its state). Anything else will silently diverge until one of those two values observes a difference.

It's probably best to detect these kernel-level variations as quickly as possible. Ideally we'd store all consensus-sensitive kernel state in the cosmos DB where it can be rolled up into the AppHash along with everything else. We tried this early on, and found the performance was troublesome, but maybe we should consider trying it again.

Description of the Design

We may benefit from a partial approach, something like:

at the beginning of controller.step(), read a "previous kernel state hash" value from the cosmos DB
hash it and the next run-queue message together
store the result back to the DB

That would at least be sensitive to deviations in the deliveries that the kernel is performing. This would have caught the GC Action variations we observed.

The next level would be to hash all the DB changes made during a block (the contents of the "block buffer") and fold them in as well. It wouldn't provide any insight into what the variance was, but it would detect it immediately. We've talked about excluding certain kernelDB keys from consensus (vat snapshot ID, transcript prefix offset); we'd need to implement that before adding the block buffer into the hash.

Consensus Considerations

This would be a consensus-breaking change. If we choose to implement it, we should do so before our first release.

Security Considerations

Most chain environments are entirely transparent, but if we anticipate some sort of hybrid model in which the swingset state is somehow concealed from the host-side application, we must consider the information leakage of this activity hash. This leakage can be prevented by including a full-strength random value (256 bits or more) into each hash computation, and not revealing this value to the host application.

Test Plan

A unit test which somehow modifies the kernel between two replays of the same state (maybe by swapping two items on the run-queue), and verifies that the two instances finish with different kernel activity hash values.

warner commented 3 years ago

I'm adding this to the Metering Phase, but since adding this feature will expose any existing non-determinism pretty quickly (and we're already finding some from the previous phase), it's really important that we do extensive multi-validator multi-restart testing before the start of that phase.

warner commented 3 years ago

It occurred to me that we don't even need to do any hashing from the kernel side. Simply writing the block buffer into a single key of the cosmos DB at the end of each block should suffice, because it will be sampled immediately (it will change the AppHash for that same block).

That might consume more space in the state vector, but it will be replaced on the next block, so it won't accumulate (except for archival nodes, which record the full history of the state vector). Recording a hash of the current block buffer, instead of the full contents, would reduce that.

warner commented 3 years ago

My current idea for the design:

controller.js is responsible for building a SHA256 hash function (out of the Node.js stdlib crypto module) and passing it as a kernelEndowment
the kernel passes this into buildCrankBuffer
the crank buffer's commitCrank is updated to:
- sort all keys of additions lexicographically
- read a special crankhash key
- initialize a SHA256 hasher with crankhash (as 32 hex characters)
- for each addition, update the hasher with add ${key} ${value}
- then do the kvStore.set as usual
- then sort all keys of deletions lexicographically
- then for each deletion, update the hasher with delete ${key}
- then do the kvStore.delete as usual
- then finish the hasher, retrieving the hashed output
- then do kvStore.set('crankhash', crankhash) (as 32 hex characters)
the crankhash key will be stored as hex, since the kvstore is a string-to-string table
the crankBuffer, kernel, and controller APIs will be extended with a getCrankHash() method, which returns kvStore.get('crankhash') (as 32 hex characters)

in the host application, the end-of-block sequence should look like:

await controller.run(runPolicy);
swingstore.commit();
const crankHash = controller.getCrankHash();
appState.save('swingset-crankhash', crankHash);
appState.commit();

In addition, the initializeSwingSet function (which initializes the database) should be updated to save an empty string into crankhash.

Agoric / agoric-sdk