Closed warner closed 3 years ago
What is the Problem Being Solved? ... We can reduce kernel memory usage by evicting idle managers from memory. ...
That looks like an opportunity for optimization, but it's not clearly a problem. Would you please state the problem as a problem?
Or perhaps change the heading? Clearly a bug issue should start with what the problem is, but for an enhancement, it seems a little awkward. Oh... but this is also a design issue... That hurts my head. In my way of looking at things, the two are exclusive.
Description of the Design We need something to coordinate the creation/destruction of these VatManagers.
We do? Why? It's not entirely clear.
I'm used to enhancements being sketched in terms of user stories, and in considering designs, often implicit requirements turn up that make it clear which designs are better or worse. I'm struggling to get oriented here. I can probably muddle through, but starting from user stories would help.
Some ideas from our meeting today that might get folded into this, or into some related ticket:
And an idea I didn't want to forget: when a new validator is being set up, we know it can't safely rely upon snapshots created by existing validators (because they aren't part of the consensus, so the new validator can't tell whether they match the consensus transcripts). But, it could spend an arbitrarily long time converting downloaded+validated transcripts into locally-reliable snapshots, by just replaying each downloaded vat history one at a time. The new validator could refrain from going online (and becoming obligated to keep up with the chain) until it had "converted" all of the important vats into fast-loading snapshots. It wouldn't have to be entirely up-to-date before it goes online, just enough to let it catch up (replay the last few days worth of transcript entries onto the previously-generated snapshot) fast enough to meet the block-time requirements. The validator could decide to go online after converting the major vats, and then continue the conversion process for the remaining vats in the background as it runs (or farm it out to other machines: that phase is quite parallelizable).
@FUDCo pointed out that, if it isn't feasible or economical to convert full transcripts into locally-reliable snapshots, validators may choose to buy/rely-upon snapshots from other validators, even though they can't be confirmed to match the official transcript. We might be able to facilitate (or at least not impair) this, maybe by giving a way for each validator to sign a statement about what its own snapshot contains. Maybe there's some way to achieve economic incentives for these "snapshot shops" to publish correct snapshots.
I edited the description to tease out testable aspects of the design.
The use cases aren't written in the form I'm used to, but it's reasonably clear, in any case, that they are, for example...
Vince, a validator operator, is happy to see roughly constant memory usage when 100 or 10,000 short-lived vats come and go, a few at a time.
Vince decides to resize his validator, so he shuts it down, adds RAM / disk / CPU / etc., and starts it up again. The restart time is not so long that he is slashed for truancy.
Hm.... testing those looks fairly involved.
I'm leaning towards vatWarehouse
at the moment. Although "factory" and "provider" might be close too. Freezer has some of the right implications, but I'd also like the name to suggest that this thing gets to make its own decisions between freeze+thaw and just-keep-it-in-RAM. Like a library that can choose between leaving books on the front shelf vs moving the unloved ones to the basement.
vatCellar
sounds too much like vatSeller
names:
- vatCoop
- vatFreezer
- vatCellar
- vatShelf
vatWrangler? vatSquad? vatMobile? vatican? vatFleet?
vatBrigade? vatDirector? vatBoss? invatory? vatory?
Another approach would be to call it the vatManager
and rename the current vatManager
to something else. I kind of like this approach, actually.
@warner agreed with my request to just go with vatWarehouse .
test-vatwarehouse.js 48a548eb
"You're going to like the way you compute. I guarantee it."
I mostly revivied #2784 , but as I look to integrate it with kernel.js
, I wonder what was the motivation for...
@param { Record<string, VatManagerFactory> } factories
The status quo API seems to have just one factory for all manager types:
function vatManagerFactory(vatID, managerOptions, vatSyscallHandler) {
cf18e52bf only 2 tests fail, and they're outdated vatWarehouse tests. 360 tests passed 12 known failures 2 tests skipped 4 unhandled rejections
@warner thanks for review of PR #2784. I added the outstanding notes to the checklist in the description above, with the exception of the crank commit complications around evict
, which seem to fit better in #2422.
What is the Problem Being Solved?
We currently maintain a
VatManager
in RAM for all non-terminated vats, both static and dynamic. These VatManager objects are tracked inephemeral.vats
, a Map from the vat ID string to a record that includes the manager:https://github.com/Agoric/agoric-sdk/blob/65105f795872a2928ce7d0bbe0971fa5b4c50897/packages/SwingSet/src/kernel/kernel.js#L125-L129
A new manager is created at startup (for all static+dynamic vats in the DB), and also when a new dynamic vat is created. The manager is destroyed for dynamic vats that terminate. Each time we need to deliver a message or promise resolution notification into a vat, we pull the manager off
ephemeral.vats
and use it for the delivery, with something like:(look at the implementation of
deliverAndLogToVat
anddeliverToVat
, among others).We can reduce kernel memory usage by evicting idle managers from memory. We can reduce overall system memory usage by terminating the corresponding workers (e.g. telling the XS worker to exit).
At startup, we can reduce memory and CPU usage by not creating managers for vats that are not yet in use (lazy-loading vats on-demand). There are policy / performance heuristic questions to answer: there's a tradeoff between latency and overall performance. If we correctly predict that certain vats are likely to be used right away (e.g. the fundamental economic vats, comms/vattp infrastructure), we might want to load them at startup instead of waiting for someone to send them a message. Likewise, if we can predict that a vat is not likely to be used for a while, we can evict it.
Description of the Design
We need something to coordinate the creation/destruction of these VatManagers. The first issue is
It needs
returnsdelivers to a VatManager for a given vatID.This call probably needs to be async, which means the rest of the kernel (
deliverToVat
, etc) must be prepared to accept a Promise back from this call.Our name for this create-or-return-cached-object pattern is "provide", so this API should probably be spelled
manager = await provideVatManager(vatID)
.It also needs
This does not need to return a VatManager, but
provideVatManager
call.initializeSwingset
for all static vats, andThe vat manager manager should have
It needs a way to know that the caller of
provideVatManager
is no longer using the object that method returned.possible future issues:
We need to coordinate changes to the transcript with changes to any snapshots that were stored. We might consider having a special crank type to record a snapshot and truncate the transcript: no user code would run during this crank, but the transcripts (in the kernel DB) would be atomically truncated in the same transaction that changes the snapshot ID.
Depending upon what sort of heuristics and eviction policy we use, we might also want an API to communicate hints about usage frequency to this new manager thing. We might record a property with each vat to say whether it is likely to be frequently used or not, which this manager thing could use to make its eviction decisions. Alternatively, the manager could rely exclusively upon its own observations of
provideVatManager
to decide which vats are deserving of RAM and which should be kept offline as much as possible.Consensus Notes
The presence or absence of a VatManager in RAM should not be part of the consensus state. Some members of a chain may choose to allocate more memory than others, and this does not affect the equivalence of their vats' behavior.
Snapshots are also not part of the consensus state, because we don't want to rely upon the internal serialization details of the JS engine. As a result, the truncated transcript is also not part of consensus state (it gets truncated only when a snapshot is taken). The consensus state does contain the full un-truncated transcript, which is necessary for vat migration as well as off-chain debugging replay of a vat. Snapshots and truncated transcripts are for the benefit of the local node, to make reloading a single vat (or the entire process) faster and cheaper.
cc @FUDCo @erights @dtribble @dckc
Notes on PR #2784
kernelSlog.delivery()
call srcprovideVatSlogger
closer to clist translation srcvat-warehouse.js
fromkernel/vatManager/
tokernel/
srckk.getVatKeeper
with a properkernelKeeper.provideVatKeeper(vatID)
src3280 split
getSourceAndOptions
to avoid accessing large source bundle to getenablePipelining
option src