Closed warner closed 3 years ago
I'm adding this to the Metering Phase, but since adding this feature will expose any existing non-determinism pretty quickly (and we're already finding some from the previous phase), it's really important that we do extensive multi-validator multi-restart testing before the start of that phase.
It occurred to me that we don't even need to do any hashing from the kernel side. Simply writing the block buffer into a single key of the cosmos DB at the end of each block should suffice, because it will be sampled immediately (it will change the AppHash for that same block).
That might consume more space in the state vector, but it will be replaced on the next block, so it won't accumulate (except for archival nodes, which record the full history of the state vector). Recording a hash of the current block buffer, instead of the full contents, would reduce that.
My current idea for the design:
controller.js
is responsible for building a SHA256 hash function (out of the Node.js stdlib crypto
module) and passing it as a kernelEndowment
buildCrankBuffer
commitCrank
is updated to:
additions
lexicographicallycrankhash
keycrankhash
(as 32 hex characters)addition
, update the hasher with add ${key} ${value}
kvStore.set
as usualdeletions
lexicographicallydeletion
, update the hasher with delete ${key}
kvStore.delete
as usualkvStore.set('crankhash', crankhash)
(as 32 hex characters)crankhash
key will be stored as hex, since the kvstore is a string-to-string tablecrankBuffer
, kernel
, and controller
APIs will be extended with a getCrankHash()
method, which returns kvStore.get('crankhash')
(as 32 hex characters)await controller.run(runPolicy);
swingstore.commit();
const crankHash = controller.getCrankHash();
appState.save('swingset-crankhash', crankHash);
appState.commit();
In addition, the initializeSwingSet
function (which initializes the database) should be updated to save an empty string into crankhash
.
What is the Problem Being Solved?
In trying to debug #3428, @mhofman and I compared slog traces of two different validators to confirm that they were running the same cranks. He found variation in the vbank messages (#3435), and we also observed differences in the GC Actions sent into vats between validators which had or had not reloaded a vat from snapshot (due to #3428 causing different GC syscalls to be made slightly earlier).
We've been kind of on the fence about how much variation to allow in the swingsets running on different validators. Currently, the only swingset variations that will cause a validator to break consensus are outbound comms/IBC messages (written into the cosmos-sdk "swingset" Module's state) and messages sent to the vbank Module (also saved in its state). Anything else will silently diverge until one of those two values observes a difference.
It's probably best to detect these kernel-level variations as quickly as possible. Ideally we'd store all consensus-sensitive kernel state in the cosmos DB where it can be rolled up into the AppHash along with everything else. We tried this early on, and found the performance was troublesome, but maybe we should consider trying it again.
Description of the Design
We may benefit from a partial approach, something like:
controller.step()
, read a "previous kernel state hash" value from the cosmos DBThat would at least be sensitive to deviations in the deliveries that the kernel is performing. This would have caught the GC Action variations we observed.
The next level would be to hash all the DB changes made during a block (the contents of the "block buffer") and fold them in as well. It wouldn't provide any insight into what the variance was, but it would detect it immediately. We've talked about excluding certain kernelDB keys from consensus (vat snapshot ID, transcript prefix offset); we'd need to implement that before adding the block buffer into the hash.
Consensus Considerations
This would be a consensus-breaking change. If we choose to implement it, we should do so before our first release.
Security Considerations
Most chain environments are entirely transparent, but if we anticipate some sort of hybrid model in which the swingset state is somehow concealed from the host-side application, we must consider the information leakage of this activity hash. This leakage can be prevented by including a full-strength random value (256 bits or more) into each hash computation, and not revealing this value to the host application.
Test Plan
A unit test which somehow modifies the kernel between two replays of the same state (maybe by swapping two items on the run-queue), and verifies that the two instances finish with different kernel activity hash values.