Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 206 forks source link

state-sync import does not repopulate historical transcript-span metadata #8025

Closed warner closed 1 year ago

warner commented 1 year ago

@mhofman did an experiment today, creating a state-sync snapshot export from a mainnet1B chain node (call it node A), then using that snapshot to populate a new node B. Then then compared the contents of a full swing-store export (not state-sync snapshot export) of node A against a similar export of node B.

He discovered that the second snapshot was lacking most of the historical transcript span metadata records (the ones that contain hashes of transcript entries). We were intending to keep these around in all nodes, even if we delete the spans themselves (the transcript entries, with deliveries and syscalls). With knowledge of the hashes, we could safely repopulate the span contents from downloaded/untrusted sources.

The IAVL tree holds the original export data (the "shadow table"), which includes all these records, so it's not that the data is missing entirely. It's just that it is only in the IAVL tree, not also in the swingstore sqlite DB where it's supposed to live (and where it would be useful to the maybe-some-day repopulation process).

We aren't sure where something went wrong. The swingstore importer has several different modes, and tests to exercise them all. But we haven't had a lot of luck with automating tests that involve creating a chain, and the validation of state-sync was mostly concerned with making sure the newly-created node was working properly (and these historical records aren't needed for normal operation, so most tests wouldn't notice them going missing). It's possible that the chain-side state-sync export process is somehow missing them, but @mhofman 's investigation suggests that it's more likely that swingstore is failing to add the incoming records to the DB, or failing to commit them, or something similar.

So the tasks for this ticket are:

For the last task, we're thinking of adding a swingstore method with a name like fixMetadata or repopulateExportData. This method would either consume an iterator of export-data key/value pairs, or it would return an object that expects to be driven with a series of these pairs (plus a stop signal). The code would examine the pairs to ignore any that relate to the kvStore table: we do not want to touch those. But for pairs that relate to the snapshots or transcriptSpans tables, it should populate any missing entries. I think it should compare the incoming export-data pair against the sqlite row, bail if the data is already in sqlite but is somehow different (the tool should be strictly adding data, not modifying existing rows), ignore the pair if the data is already in sqlite and matches, and add the data to sqlite if it's missing entirely. Then the method should do a commit, or make it easy for the caller to do so.

The actual chain-side upgrade handler will then call this method, once, before the kernel controller is built (specifically before the swingstore is built). It will need to iterate the IAVL tree (the portion of the namespace dedicated to swingstore state-sync export data), feeding everything into the fixMetadata method, and ensuring it gets committed. Once complete, the swingstore should be intact again.

We probably don't want this fixMetadata process to emit new export-data to a callback (which might happen if we repopulated the records using a normal swingstore object), since the data will already be present in the IAVL side (the normal provider of that callback, and recipient of the updates).

warner commented 1 year ago

Also note that the state-sync import process works by creating a (large) temporary file and sending the filename from the golang side to the JS side, rather than sending all the actual export-data records and artifacts through that bridge. The repopulation process maybe easier to implement if we re-use that same logic, at least on the golang side.

warner commented 1 year ago

On the export side, chain-main.js calls spawnSwingStoreExport without an exportMode, so the argv array it creates will lack an --export-mode=, so the same file (but in a child process) will get exportMode = undefined, so initiateSwingStoreExport will get the same, so makeExporter (which is really makeSwingStoreExporter) will default to exportMode='current', which includes all metadata (and merely omits the old artifacts).

So I believe the state-sync snapshots contain the expected metadata, but only current artifacts (no historical ones).

Later, the import side is called when chain-main.js calls performStateSyncImport without the includeHistorical option, so importDB (which is really importSwingStore) is called with includeHistorical: undefined, which allows importSwingStore to default to false.

Here, the implementation of importSwingStore unfortunately depends too much on the includeHistorical mode. Each export-data (metadata) record is examined to see which subsystem owns it: kvStore records are stored immediately, but bundle/transcript/snapshot records are parsed and held in RAM for later:

https://github.com/Agoric/agoric-sdk/blob/6e5b422b80e47c4dac151404f43faea5ab41e9b0/packages/swing-store/src/swingStore.js#L1015-L1034

This artifactMetadata Map is never enumerated, and it is only consulted if 1: the importer knows about a vat (so it looks for the current transcript span, and possibly a heap snapshot), or 2: includeHistorical is true and the getArtifactNames() query yields a matching artifact name.

The metadata keys are given to the importer, but they're only stored in the DB if there is a matching artifact (which there isn't, because we did exportMode='current' instead of archival ), and if the importer is looking for them (which it isn't, because includeHistorical: false).

Fixing Import To Not Omit Metadata / Export-Data Keys

To fix this, we need to rewrite the importer to store every metadata key that it receives in the export-data, regardless of includeHistorical and/or the presence of a matching artifact. This will require some new code in snapStore.js, which only has a SQL query to insert a row with a full snapshot, which of course we do not have yet.

Then, after all the metadata is stored, the importer should walk the list of known vats, and import the current transcript (and possible snapshot) for each, as it does now. These are mandatory artifacts, so it happens without calling getArtifactNames (and the import fails if any are missing).

After that, it should call getArtifactNames to find all the optional artifacts, skipping the ones it already imported, and validating the rest. This is the portion that includeHistorical: false could omit, but I'm not entirely convinced that this option is useful: I think the importer should take everything it's given, and leave any pruning up to the exporter.

I have a new set of unit tests in mind to submit an explicit set of metadata and artifacts, and then examine the swingstore afterwards to ensure that all the metadata was recorded, whether or not the optional artifacts were included.

Repopulating (Accidentally-Omitted) Metadata / Export-Data Keys

Once that is fixed, to fix the mainnet chain nodes that were launched from a state-sync snapshot (and are now missing the historical metadata), we need some new swingstore APIs. The main one will be named repopulateExportData:

Then a chain upgrade handler can walk the IAVL tree for previously-exported metadata keys, and submit all of them to repopulateExportData(). If that completes successfully, the swingstore will be complete once again.

At this point in the upgrade handler, our plan is to have the host app use makeSwingStoreExporter() to retrieve all of the metadata keys and move them to a new portion of the IAVL tree. However, that uses a separate DB connection, and thus depends upon the data being committed, and I know @mhofman was hoping to have just a single commit point for the whole chain upgrade handler. The repopulateExportData process is idempotent, however, so I think it'd be ok to hostStorage.commit() the swingstore at that point (in addition to a later commit when the first post-upgrade block is finished).

Repopulating (Previously-Pruned) Artifacts

While doing this work, I'll be refactoring the swingstore import code, so it's a good time to at least plan for a second API, which will be used in the future to repopulate pruned artifacts like old transcript spans. To this end, we sketched out repopulateArtifacts(). This would take a swingstore dbPath (like importSwingStore), and is meant to be run from a standalone process (drawing from a zipfile that contains the artifacts to be imported).

The idea is:

The API will look like:

mhofman commented 1 year ago

This will require some new code in snapStore.js, which only has a SQL query to insert a row with a full snapshot, which of course we do not have yet.

I'd like if we could consider landing https://github.com/Agoric/agoric-sdk/pull/7635 since it has significant and partially overlapping changes in that area already.

  • this will be a hostStorage method on a swingstore instance, to give the host control over the commit point
  • hostStorage.repopulateExportData(exporter) will take the same exporter as importSwingStore

I am worried about tackling this on the regular hostStorage API. The method has to be async given the shape of the exporter, and really during this repopulation, nothing else should touch the swing-store. Why can't this be a top level function that returns the swing-store after repopulation is complete, like importSwingStore ?

The main one will be named repopulateExportData

We only care about the artifact metadata in this export data and shouldn't import the kvStore data part of it. How about repopulateArtifactMetadata. The exporter shouldn't have understanding about the shape of keys in the export data.

I know @mhofman was hoping to have just a single commit point for the whole chain upgrade handler. The repopulateExportData process is idempotent, however, so I think it'd be ok to hostStorage.commit()

I was never planning on a single commit point. As you say, this is idempotent, so safe to commit early.

then we ship the chain software upgrade that includes the new xsnap and performs the replays at startup

  • (or use @mhofman's trick to pre-compute as much as possible)

So this brings an interesting use case: import artifacts and the related metadata, overriding current content. This makes me kinda want to go back to a single repopulateArtifacts with the following signature and semantics:

/**
 * @param {string} dbDir Path to swing-store DB
 * @param {SwingStoreExporter} exporter the exporter containing artifact information to use
 * @param {object} [options]
 * @param {boolean} [options.overwriteExisting=false] Whether to overwrite existing artifacts and their metadata
 * @param {'operational' | 'latest' | 'archival' | 'debug'} [options.toLevel='operational']
 */
function repopulateArtifacts(dbDir, exporter, options) {}

repopulateArtifacts would always call getExportData and always verify the toLevel consistency

I believe the above method would cover all 3 use cases we have:

For now we can leave the overwrite part out, but basically I am no longer convinced that the 2 kind of imports artifact use cases we have today are sufficiently different to warrant 2 different methods.

warner commented 1 year ago

This will require some new code in snapStore.js, which only has a SQL query to insert a row with a full snapshot, which of course we do not have yet.

I'd like if we could consider landing #7635 since it has significant and partially overlapping changes in that area already.

I'll read up on that.

  • this will be a hostStorage method on a swingstore instance, to give the host control over the commit point
  • hostStorage.repopulateExportData(exporter) will take the same exporter as importSwingStore

I am worried about tackling this on the regular hostStorage API. The method has to be async given the shape of the exporter, and really during this repopulation, nothing else should touch the swing-store. Why can't this be a top level function that returns the swing-store after repopulation is complete, like importSwingStore ?

I don't object to that plan, but I was thinking about how the cosmic-swingset upgrade path will be calling this. I was assuming that the upgrade handler code would have already opened the DB (holding a hostStorage), and needs to mutate it in-place, and then keep using that same handle, so something like:

import { openSwingStore } from '@agoric/swing-store';

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
await doRepopulation(hostStorage, IAVLStuff);
await buildController(kernelStorage);

If, instead, the upgrade handler is likely to get control well before the swingstore is opened "for real", then I'd make it:

import { repopulateExportData, openSwingStore } from '@agoric/swing-store';

await doRepopulation(dbPath, repopulateExportData, IAVLStuff);

const { hostStorage, kernelStorage } = openSwingStore(dbPath);
await buildController(kernelStorage);

I don't think I'd have repopulateExportData return anything: it would do the final commit() internally, and its job is then done. Acquiring a hostStorage/kernelStorage pair is then up to the caller to do separately.

The main one will be named repopulateExportData

We only care about the artifact metadata in this export data and shouldn't import the kvStore data part of it. How about repopulateArtifactMetadata. The exporter shouldn't have understanding about the shape of keys in the export data.

Yeah, I'm ok with that. Agreed that the exporter should not be parsing keys. So the change to the API description is that any new or different kvStore shadow entries cause an error+abort.

then we ship the chain software upgrade that includes the new xsnap and performs the replays at startup

  • (or use @mhofman's trick to pre-compute as much as possible)

So this brings an interesting use case: import artifacts and the related metadata, overriding current content. This makes me kinda want to go back to a single repopulateArtifacts with the following signature and semantics:

For now we can leave the overwrite part out, but basically I am no longer convinced that the 2 kind of imports artifact use cases we have today are sufficiently different to warrant 2 different methods.

I'm gonna disagree. My biggest reason is that any metadata import is an act of trust/reliance/security. I really prefer having a strong separation between APIs which allow the data provider to compromise integrity, vs those which do not. Having an option (even with a safe default) to enable/disable this doesn't feel visible enough. And, having consensus-sensitive behavior driven by an option is how we got into this state.

My second reason is that precompute is far enough out that we shouldn't compromise the API to accomodate it yet (I'm not convinced that we understand the needs well enough to get the API right yet). If I squint, I can imagine that the precompute step will be run as a bunch of separate processes, one per vat, and they follow the deliveries made on the real kernel (maybe they snoop the real swingstore), and they accumulate state in a set of pre-compute DBs (one per vat), and then when we finally trigger the upload, there's a merge step. It has to delete the current-incarnation transcript spans and replace them from the precompute DB, and provide a heap snapashot, but what else might change? It's going to depend upon how much XS variation we're willing to tolerate. If GC is different then we might be talking about reconciling c-lists or vatstorage between the two sides, which means some per-vat kvstore keys are deleted or changed, which goes way beyond discrete artifacts.

So I want to stick with two APIs, one "trusting" (integrity relies on the data provider), one "safe" (data provider cannot violate integrity).

Also, looking at your proposed API, it makes me realize that we're abusing the behavior of exporter as a sort of additional options bag. If my API description for the "trusting" API states that it won't call exporter.getArtifactNames or exporter.getArtifact, then maybe it shouldn't actually take an Exporter: maybe it should only take a function with the same signature as exporter.getArtifactData. The simplest way to not violate a calling convention is to delete it.

Then the "safe" API would take { getArtifactNames, getArtifactData }, and not the full Exporter type.

The net result would be:

both of which open their own DB connection, do their work, then either commit+fulfill or abort+reject.

And, repopulateArtifacts is lower-priority, and out-of-scope for this ticket. I'll keep it in mind when doing the refactoring, and maybe have a branch where it's implemented, but I don't want to hold up the primary feature on it (or on building tests for it, which is absolutely required before that part lands).

mhofman commented 1 year ago

I was assuming that the upgrade handler code would have already opened the DB (holding a hostStorage), and needs to mutate it in-place, and then keep using that same handle

I am actively refactoring the init logic to assert that init hadn't happened yet in these cases. Will also parametrize init to indicate if it's a genesis init or not.

I already added a check on the cosmic-swingset side in https://github.com/Agoric/agoric-sdk/pull/8043

I don't think I'd have repopulateExportData return anything: it would do the final commit() internally, and its job is then done. Acquiring a hostStorage/kernelStorage pair is then up to the caller to do separately.

I'd prefer to keep the "import" API consistent.

warner commented 1 year ago

I don't think I'd have repopulateExportData return anything: it would do the final commit() internally, and its job is then done. Acquiring a hostStorage/kernelStorage pair is then up to the caller to do separately.

I'd prefer to keep the "import" API consistent.

Would that mean repopulateExportData returns the same objects as openSwingStore? That seems odd to me.. the repopulation code doesn't need to know what the user-facing kernelStorage/hostStorage API is. And isn't this going to be called in an upgrade handler which already has code to call openSwingStore() in the non-upgrade case? Is the upgrade handler going to provide hostStorage to the non-upgrade path somehow?

I'll start working on the new API, but I'll check in with you on this question before making the PR, to make sure it lines up with your code on the cosmic-swingset side.