Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
326 stars 205 forks source link

price-feed vats hold old QuotePayments in recovery set, causing storage leak #8400

Closed warner closed 3 months ago

warner commented 1 year ago

Describe the bug

I spent a few days investigating storage usage on mainnet, and learned that about 60% of our kvstore entries are the result of a bug in the price feed contracts, and this count grows by about 10 rows for every new price quote created.

External oracles submit messages (signed txns) into the chain with claims about the e.g. ATOM/USD price at a given time. These get executed as contract operations (create an invitation with args that include the price claim, make an offer, get a seat, etc). The price feed contract then publishes a price quote to subscribers, such as an aggregator that averages/etc quotes from multiple sources to drive Vault creation or liquidation.

While the authority to produce a quote is limited by virtue of the wallets which have access to the price feed's objects, the quote objects themselves might be delivered to arbitrary subscribers (over untrusted paths), and those subscribers want to verify that the quote object they received is genuine. The current implementation uses ERTP Payment objects to represent the price quotes, and subscribers can call E(priceQuoteIssuer).getAmountOf(allegedQuote) to confirm their authenticity.

generating a quote object: https://github.com/Agoric/agoric-sdk/blob/92b6cd72484079b0349d8ccfa4510aeb820e8d67/packages/zoe/src/contracts/priceAggregator.js#L103-L109

verifying one: https://github.com/Agoric/agoric-sdk/blob/92b6cd72484079b0349d8ccfa4510aeb820e8d67/packages/zoe/src/contractSupport/priceAuthorityTransform.js#L90-L92

However, subscribers like the scaledPriceAuthority stop there: the quote payment is dropped, without being deposited or burned.

This interacts badly with a safety feature of ERTP: the Recovery Set. To guard against assets being lost, each time you call payment = await E(purse).withdraw(), the payment is added to the Purse's "recovery set": a strong SetStore. If you send the Payment to a correspondent, but they fail to deposit it, you can call getRecoverySet() to access all these Payments, and then deposit them back into a Purse. Or, you can call recoverAll() to sweep them back into the original Purse. Brand new tokens, created by E(mint).mintPayment(), are held in a special mintRecoveryPurse, available from the IssuerKit. All Payments are removed from their Recovery Set as soon as they are deposited in some other Purse, or burned.

Recovery normally isn't necessary, because well-written code will deposit a Payment as soon as it receives it, and only some sort of error would prevent that from happening. However, quotes-as-payments aren't assets to be hoarded: the code only uses ERTP Payments for their authenticity-verification properties, not their exclusive-transfer properties. So it is not surprising that the price contracts fail to deposit() or burn() them.

The consequence is that the Payment remains in the recovery set forever.

The contract and ERTP code uses a variety of SetStores, WeakMapStores, and Exos to express this logic. Swingset (specifically liveslots) supports those with "virtual collections" and "virtual objects", all of which bottoms out in a per-vat key-value table named the "vatstore". This holds rows to represent collection entries, ordinals to enable consistent iteration through collections, reference counts, weak-key reference-graph edges, export statuses, and finally the virtual-object state itself.

The particular data structures used by ERTP means that each Payment costs about 10 vatstore entries. In the following diagram, each circle is a vatstore entry, of various types.

v29 quote payment

And this is an example of one particular (still-extant) payment, where each gray box is a separate vatstore entry:

v29 quote payment example

Since each vatstore entry is really stored in the kernel's kvStore (a table inside the "SwingStore" SQLite DB), they cost 10 kvStore entries each. And because state-sync integrity requires us to mirror all kvStore items into the IAVL database, they also cost 10 IAVL entries each. We've been creating quotes once every five minutes since the Vaults feature was released last June, and by now we have about 120k quote payments kept alive in v29 (the ATOM-USD price-feed vat). A similar problem is keeping 90k payments alive in v46 (scaledPriceAuthority). As we add new denominations in the future, their price feed vats will have the same issue. I estimate this currently costs 440MB of space (combining both swingstore and IAVL), growing at 5MB/day.

Additional Costs

v46 (scaledPriceAuthority) has 90k quote payments in this state, however it has also exported about 30k of them to v7-board. That vat is still holding weak references (c-list entries in the "recognizable" state, as opposed to "reachable"). This causes additional kvstore entries for each one: two for the export-side c-list, two in the kernel object table, two in the import-side clist, and some amount of RAM in v7-board.

We believe that, at some point, v46 received the payment and sent it to E(board.readOnlyMarshaller).toCapData(). The serialization process submits the received quote-payment Presence object to passStyleOf, which would have concluded that the Presence was passable, and added the object to it's WeakMap cache. This WeakMap is virtualized by liveslots (to prevent userspace from sensing GC, as usual), and VirtualObjectAwareWeakMap remembers a real Map keyed by vref strings. So v7-board will spend some (small) amount of RAM on each such object. These WeakMap references cause the c-list to remain in the "recognizable" state: they do not represent a strong ("reachable") reference.

Both the c-list/kvstore entries and the RAM costs will go away if/when the quote payment objects are released.

We've been considering allowing the Endo/passStyleOf code to use a real WeakMap instead of the virtualized one. This requires the permission of liveslots (otherwise arbitrary user code could do it, violating our GC-sensing rules), so coordination between Swingset and Endo is ongoing. If implemented, this would remove the separate passStyleOf cache and would probably allow v9-board to avoid spending either RAM or c-list imports on the object.

Potential Fixes

To flatten this growth rate, at today's Zoe meeting we discussed one short-term fix named "burn after reading": clients should burn their quote payment objects after verifying them. For example, the v29 price-feed quotes are received by v46-scaledPriceAuthority, where priceAuthorityTransform.js, could do:

    const { value: sourceQuoteValue } = await E(sourceQuoteIssuer).getAmountOf(
      sourceQuotePayment,
    );
   await E(sourceQuoteIssuer).burn(sourceQuotePayment); // burn after reading

We would need similar changes in all other places that receive quote payments. In particular, the scaled/aggregated quotes produced by v46-scaledPriceAuthority are delivered to v48-vaultFactory, where something needs its own burn() call.

Another approach would be to have the producing code rotate through recovery purses every hour or so:

In the longer term, we should find a lighter-weight means to verify the authenticity of price quotes. ERTP Payments offer both verification (through issuer.getAmountOf()) and exclusive transfer (through purse.deposit(payment)), but this use case does not require exclusive transfer. The ocap community provides other tools for mere verification, like the Branding pattern. The "issuer" replacement can retain a WeakMap of quote objects, and offer an API to perform validQuotes.has(allegedQuote).

Remediation

In addition to fixing new price quotes, we also need to dispose of the hundreds of thousands of existing ones. One idea from today's meeting:

Another approach would be to have an upgraded producer vat immediately perform mintRecoveryPurse.recoverAll() (perhaps assuming that no quote payments are outstanding at the moment of upgrade, or accepting one quote getting invalidated before reading). However we need to be cognizant of the size of the recovery set: this would cause v29 to immediately delete about a million items from its vatstore, requiring maybe ten million syscalls, all in a single crank. Who knows how long that might take. In addition, that would remove a million rows from the IAVL tree, and we know that IAVL pruning takes a long time when there has been a lot of churn, so each validator might experience a significant (vote-missing) delay some arbitrary number of blocks after the upgrade (depending upon their individual pruning configuration).

Note that different approaches require upgrades to different vats, and other bugs make some vats are more practical to upgrade than others. In particular, we'd like to fix the price-feed code before making a lot of new deployments, to avoid adding more load to the system, but eventually we need to fix (and also remediate) the existing vats.

cc @Chris-Hibbert @erights

Chris-Hibbert commented 9 months ago

@warner wrote:

I did some estimates on the 10-at-a-time thing, I think that's a good way to approach it. As of 01-dec-2023, for bug #8400 (Quote Payment), we've got 218k payment objects to delete in v46-scaledPriceAuthorityATOM 291k in v29-ATOM-USD-price_feed, 98k in v68-stATOM-USD_price_feed, 74k in v69-caledPriceAuthority. At the current PushPrice rate (one every 60 seconds), we'll clear the v46 backlog in 15 days, causing a 1.2-second BringOutYourDead every 30 minutes. The v29 backlog will have a similar load but take proportionally longer (maybe 20 days)

Chris-Hibbert commented 9 months ago

8498 is an update to #8418 that deletes the vestigial payments incrementally.

warner commented 7 months ago

Note to self: to measure the size of this problem, use SwingSet/misc-tools/categorize-kvstore.js:

% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json ingest run-17-swingstore.sqlite
% node ~/trees/agoric-sdk/packages/SwingSet/misc-tools/categorize-kvstore.js --datafile data17.json many-voms |grep "quote payment"

vatID  kindID tag                  defined referenced recognized exported sample
v29        22 quote payment         325384     325372     650744        0 {}
v46        22 quote payment         244029     244029     488058    81361 {}
v68        22 quote payment         144240     144240     288480        0 {}
v69        22 quote payment         108180     108180     216360    36080 {}
Chris-Hibbert commented 4 months ago

After #9283 and #9382, we'll no longer have an active leak. The remaining issue is how to deal with the accumulated storage, and that's addressed in #8928. I'm reassigning this to @warner.

aj-agoric commented 4 months ago

@warner to make another ticket describing the planned remediation. We will close this ticket once the sub tickets are landed on 16.

warner commented 3 months ago

Ok #9483 will track the creation+deployment of a chain-upgrade step that terminates (and initiates the slow deletion of) the old price-feed vats. That, plus/preceded-by the deployment of the #9283 / #9382 fixes to mainnet, will complete the remediation of this problem, so I'm closing this ticket.