add "flight recorder", triggered by consensus failure

warner commented 3 years ago

What is the Problem Being Solved?

The last three consensus-failure divergence bugs we've found were resolved by comparing the final block's slogfile data between the failing validator and one of the surviving ones. With the #3442 activityhash included in the Cosmos state, the divergence is discovered before moving on to the next block, and the slogfile records the divergent cranks even though Cosmos does not (necessarily?) commit the failing block, so the slogfile has always (so far) included fairly precise evidence of the nature of the divergence.

We need to be able to get this data from validators that aren't collecting slogfiles all the time.

I'm thinking of something like an airplane's "black box": the flight recorder that captures the last N minutes of activity, and is designed to survive a crash. For us, this would capture the slogfile entries of the most recent block. Once the validator proceeds past the point of a potential consensus failure, we no longer need to retain that data.

Description of the Design

I'm not sure about the cosmos-sdk code paths involved, but I'm thinking something like this:

we enhance SwingStore to have an additional component, tenatively named flightRecorder, which sits next to kvStore and snapStore and streamStore
when the kernel creates the slogger (which may or may not have a filehandle/filename for permanent recording), it arranges to send slog writes to the flightRecorder as well
the SwingStore flightRecorder holds these entries in RAM, and deletes them upon commit
we change the cosmos-sdk consensus-failure display code to query the SwingStore to fetch the contents of the flight recorder, write them to a file, and display the filename along with the rest of the message
we include an admonition to submit the flight recorder data to a central location, or to include it in an attachment in a bug report
maybe take a hash of the contents (omitting the always-variable .time fields) and include $HEIGHT-$HASH in the instructions, so that anyone else experiencing that same problem will get the same string and will know it's a duplicate

I can think of a couple of reasons why we'd need to implement this differently:

Does cosmos-sdk perform the SwingStore commit before checking the AppHash? If so, commit would delete the flight recorder data too early. We'd need to add some additional APIs to manage the lifetime of that data.
Can the consensus-failure -emitting code talk to the JS side synchronously, to retrieve the data before emitting the message? I know there are usually goroutines and asynchronous message ports involved. If not, maybe the flight recorder needs to constantly write into a pre-named file, and we just add that filename to the failure-announcing message.
- This form creates more wear and tear on the disk, especially because something in the commit process is undoubtedly doing an fsync or two. To avoid this, it would be fine (and slightly preferable) to exclude the flight recorder file from this fsync, if possible, because we're not worried about the data being preserved against sudden kernel shutdown, merely against sudden process shutdown. As long as the flight recorder file writes are not buffered in userspace, and actual write(3) syscalls can be buffered in the kernel without problems, as they'll still be written to disk when the process exits.
- (There is a more sophisticated technique available: a watchkeeper process/thread opens a shared block of memory with the validator process, the flight recorder data is written into that memory in a circular buffer, and the watchkeeper survives the termination of the validator. This doesn't need to touch the disk, but is more complicated to set up, and would be more challenging to integrate into systemd -type supervisors)

cc @michaelfig @JimLarson for thoughts about how to integrate this into cosmos-sdk

cc @FUDCo for thoughts about SwingStore (we should probably refactor/consolidate SwingStoreLMDB first)

Security Considerations

The flight recorder data contains the internal processing steps triggered by the shared input/transaction stream of a consensus-based SwingSet. A solo swingset would consider this to be private: even if the externally-supplied messages are known to another party (at least to the sender), the internal consequences are not. But in a consensus swingset, this data is known to all the other members.

So we should make it clear that the flight recorder data is privacy-equivalent to the chain's internal state, and should only be shared with someone else who could already know that state. Practically speaking, this is everybody, because chains are transparent. But when we add this feature, the swingset-side API should make a stronger statement, and the cosmic-swingset/chain -side should be where we acknowledge that the contents are effectively public.

Test Plan

We can write unit tests for the SwingStore portions. I don't know how to test the cosmic-swingset side. Ideally we'd have a unit test that somehow triggers the consensus failure and examines the output message for the "please upload the black box" string, and examines the contents of the named file, but I imagine that test would be hard to write. Maybe @michaelfig or @JimLarson have some better ideas.

FUDCo commented 3 years ago

cc @FUDCo for thoughts about SwingStore (we should probably refactor/consolidate SwingStoreLMDB first)

This makes a lot of sense to me. For the record, the existing issue for that is #3431. I will move it to my "in progress" column.

michaelfig commented 3 years ago

Does cosmos-sdk perform the SwingStore commit before checking the AppHash?

x/swingset raises a COMMIT_BLOCK event during Cosmos-SDK committing its merkle tree to disk (and concurrently computing the AppHash for that block). cosmic-swingset uses that event to commit the SwingStore as well.

The next proposal and results of the vote is what succeeds is what decides if the committed-to-store AppHash is in consensus, or if the node needs to halt.

It would be even more useful if somehow the current state (Tendermint, Cosmos, and JS) wasn't committed until the AppHash was voted to be in consensus, sometime when the AppHash and the next block's transactions were confirmed before the BEGIN_BLOCK of the following block. That's a change that may have wider repercussions, but would be interesting to explore.

michaelfig commented 3 years ago

It would be even more useful if somehow the current state (Tendermint, Cosmos, and JS) wasn't committed until the AppHash was voted to be in consensus, sometime when the AppHash and the next block's transactions were confirmed before the BEGIN_BLOCK of the following block. That's a change that may have wider repercussions, but would be interesting to explore.

That's probably the definition of "equivocation". Too bad.

Tartuffo commented 2 years ago

Rolling slog file with some retention is a possible short term solution. We can record crank hash for correlation. Could also time slice kernel database (but that is hard, maybe MN-2 or MN-3).

Tartuffo commented 2 years ago

One part is instructing the node to write the slog file, which we have an env variable for, can point to file or Unix named pipe. Could pipe to a tool which knows how to roll the files. Small code changes on the cosmic-swingset side (when build controller, here is a function to write to the slog file). Michael: would like to use slog hooks.

Need to give instructions to validators in terms of best practicers (or give them a start script?). Initially, we should default to collection / rolling.

warner commented 2 years ago

I'll add a runtimeOptions entry to the controller, to let the host application provide a function which receives about-to-be-slogged objects (with type: etc). cosmic-swingset can then do some thing smarter, like serialize it to a file descriptor that the validator's management tooling (systemd/etc) can direct to a rotating-logger tool. Or cosmic-swingset can rotate the files itself. I'll leave runtimeOptions.slogFile in place for the benefit of our usual debugging flow.

I might hijack runtimeOptions.slogCallbacks for this. Currently this is a hook that lets the host application get involved in the full slog API (slogger.delivery(), slogger.syscall(), etc). @michaelfig and I roughly think that it'd be easier to let the slogger move from the API calls to "slog objects" ({type:, when:, .. }), and give the slog object to the hook, which can decide about serialization and destination. This lets us change the slogger API without needing host application hooks to change too.

warner commented 2 years ago

We include the crankhash in the consensus state, which captures all changes to the kvstore. This ought to capture any kernel divergence within a single block, as well as any vat divergence that is visible as syscall divergence. Our validator guide instructions should direct most validators to retain a day or two of slog data, and to send us that data if their node ever falls out of sync. Then we can build a tool to correlate these slogfile fragments and identify the point of divergence.

@michaelfig points out that in the longer run, we really want to be able to snapshot kernel kvStore (swingstore) state, for the benefit of "state sync" as applied to kernel state (#3769). If we had that now, we could ask the out-of-sync validators for a copy of their entire kernel DB, and compare those. But 1: we don't, and 2: the slog data is probably a lot smaller than the whole kernelDB.

Tartuffo commented 2 years ago

If we can allow validators to write slog entries out to files, and preserve so we can figure out failures.
We need investigate wether slog writes are expensive / synchronous or not. Need config option on AgD that allows slog messages to be written that is conducive to operations.

Tartuffo commented 2 years ago

We should default to slog files are on, into a different file every hour, and the Validator can use whatever tool they want to not overflow disk. Cosmic Swingset needs to take an arg as to where the slogs should go.

warner commented 2 years ago

From today's meeting, our conclusion was:

agd/cosmic-swingset should have a configuration setting for a target directory and some file-rotation parameters
agd should use the swingset writeSlogEvent hook to write rolling files to that directory, all the time
the defaults should write the files to the same directory that holds all other swingset state, and rotate once an hour, and keep maybe a few days of data
our validator documentation should have something like: "if your node announces a consensus failure, please look in directory XYZ for the most recent slog events. Please compare them against the other validators to see if there is a clear point of divergence. Agoric may be able to help with the investigation if you can provide those files."

We'll probably need to write some tools to compare two such files, remembering that every validator will be rotating the files at different times. We can probably defer writing those tools until we need them.

We don't need any swingset-specific functionality for this: the host application above swingset (eg agd) should be in charge of writing to files, etc.

Given the inclusion of the crankhash in each block, it seems likely that any kernel divergence will be detected within a single block, so only the slog data from the last 10-ish seconds is likely to do any good. Knowing when the kernel was last restarted may be useful too, that is recorded in the slogfiles too. If divergence happens shortly after reboot, then the last few minutes of data may catch it. The case that might not be caught is if rebooting is the root cause, but the diverging vat doesn't get used for quite a while. In any case, we'd start with the most recent minute of slog data and then work backwards if necessary, so it'd be nice if we could go back to the validator and ask for older data.

warner commented 2 years ago

@michaelfig is working on a change to add a "write slog object" hook to the kernel, which is what cosmic-swingset will use to get at the slog data.

This plan puts most of the work on the cosmic-swingset side, so I'm reassigning to mfig.

Since he mentioned that Node.js seems to make immediate (non-async) fs.write calls pretty expensive, one trick I've always wanted to use would be mmap() (I hinted at this above, but I realized you don't need a separate watcher process). I have no idea how to make this work under Node.js, but if it were some other language, I'd mmap() a 100MB file, reserve a few bytes for offsets, then use the rest as a circular buffer to hold length-prefixed records. The agd process would keep writing new events into the RAM window and updating the start/end pointers, with some sequencing to make sure the record gets written before the pointer is updated. Then, no mattter how abruptly the process is terminated, as long as the host computer itself is still running, the on-disk file would contain the most recent state, and anybody who reads the file will get the most recent state. The host kernel (linux) is under no obligation to flush it to disk any particular time, but knows when reads happen, so there's no coherency problem, and the speed is unaffected by disk write speeds.

We'd need to write a tool to read the circular buffer from the correct offsets and emit the slog records in the right order, but that should be trivial. The trickiest part I imagine is representing this within Node.js. Maybe with some sort of magic SharedArrayBuffer??

Another thought is to write the slog records to a pipe, and put a C/Rust/etc program on the other end of that pipe, which knows how to mmap() and update the circular buffer. It'd be harder to deploy, though, (we'd have to ship two binaries, the main agd plus the helper log-writer), but wouldn't require any Node.js magic.

Agoric / agoric-sdk