build external refcount checking tool

warner commented 10 months ago

As mentioned in #8756 :

We could check the need for remediation by building an external refcount checker: a tool that walks the DB and looks at each vat's vatstore, finds references from virtual data, adds them up, and compares against the rc count in the DB. We can do the same for ir recognizer records and the keys of weak virtual collections. This bug would manifest as a difference between the two numbers.

The task for this ticket is to build that tool, and run it against the current mainnet database, to see if we have any problems. I suspect everything is fine, but it might show evidence of #8756, or of some other as-yet-discovered GC problem.

The tool should check both kernel and vat refcounts for consistency. The starting point should be a copy of the kvStore (just a new DB with CREATE TABLE kv (key STRING, value STRING) and a copy of the whole kvStore), which will be a lot smaller and easier to work with than the full swingstore.

Then I'd parse that data into a better-organized third DB. It should be parsed into a kernel-object table and a c-list table. Then the keys for each vatstore should be parsed and copied into tables for:

virtual-object Kind metadata
virtual-object state data (vom.${vref})
virtual-collection metadata (e.g. vc.5.|schemata)
virtual-collection entries (the encodedKey -> marshalledValue "forward" entries) (vc.5.${encodedKey})
virtual-collection ordinal entries (the vref -> ordinal "reverse" entries) (vc.5.|${vref})
refcounts (the rc entries)
export-statuses (es entries)
recognizer entries (ir entries)

Kernel Refcounts

The kvStore should be parsed into a kernel-object table, with columns for:

koid (the object id kref)
owner: a vatID, or NULL for abandoned objects
reachable and recognizable counts from the ${kref}.refCount entries

and also a c-list table:

vatID
kref
vref
exported: 1 for c-list entries whose vref is o+, 0 for o-
reachable flag (boolean)
recognizable flag (boolean)

In general, the "reachable" refcount should equal the sum of the exported=0 AND reachable=1 entries in the c-list table. The "recognizable" refcount is computed similarly. We ignore the exported=1 entries, because their reachable flag indicates whether the kernel can reach the exporting vat's object, so it doesn't count towards the refcount.

Some objects might have a pinned refcount (via controller.pinVatRoot()), we might have elected to do this to the bootstrap vat's root object, but maybe not. Any other discrepancies might indicate a kernel GC problem.

Vat Refcounts

In addition to parsing the kvStore into vatstore entries for each vat, and then parsing the vatstore entries into virtual-thing -specific tables, the tool should also scan the vom virtual-object state records for vrefs, and populate a table that says e.g. "virtual object o+d10/12 has a strong reference to vref o-3". This table would also have rows for collections which point to other vrefs, maybe also through a third table that captures both the collection's vref and the vref of the key (the implication being that both source vrefs must be alive to keep the target vref alive). And some variant of this for weak collections would be good.

Then, the tool should add up the inbound strong references from both virtual-object state and collections, and compare that count against the rc entry. If they don't match, we've got a bug (perhaps #8756, perhaps a new one).

It should also compare the es export status with the matching c-list entry for that vat. If es claims the exported vref is reachable, the c-list should say R. If es says merely-recognizable, it should say _.

Finally, it should check the recognizability records for consistency. Vrefs that are used as keys in weak collections will appear both in the collection entry rows (as an ordinal), and there will be an ir record mapping from the vref to the collection ID. If we see one without the other, we've got some other kind of bug.

mhofman commented 10 months ago

If we're gonna do this, could we not also do a full trace to detect orphans and unreferenced cycles?

warner commented 10 months ago

My immediate need is to tell whether #8756 needs remediation, or if we can simply deploy a fix and not worry about cleaning up old data, and that just need to look for refcount discrepancies.

This tool would certainly be the basis for a full mark-and-sweep trace, but I'd consider it extra credit, and I'd like whoever implements this to answer the remediation question before doing the extra work.

warner commented 2 months ago

Talking with @siarhei-agoric , we decided that many modes of this tool do not really need a separate intermediate DB. For example, auditing for refcount consistency within a single vat can do:

walk all virtual-collection and virtual-object rows of the vatstore
accumulate refcounts in a RAM-based JS Map
once the walk is complete, iterate through the Map and check the refcount for each
also walk the refcount section of the vatstore and check map.has() for each, to look for stranded refcount rows

That requires RAM in proportion to the number of vrefs in use by the vat, perhaps 1-10M for the larger ones, but does not require storing full details about each edge.

If the tool detects a mismatch, we could re-run it in a verbose mode, with a specific set of vrefs to look for (repeating the full walk to gather more data), but with luck it will report "all refcounts match" and we don't need to spend the extra effort.

Agoric / agoric-sdk

build external refcount checking tool #8759

Kernel Refcounts

Vat Refcounts