Limit size of stable variables, keep canister upgradeable

nomeata commented 2 years ago

While the semantics of stable variables is (presumably) quite nice, our current implementation is lacking and actually dangerous: Because we serialize all the data reachable from stable variables in pre_upgrade, it is easy to get a canister that accumulates data in stable variables into a non-upgradeable state after a while. This is a very serious risk for our users, and there is little they can do to prevent that. The only advice I can give right now is "don't do that", but we can't even quantify this!

There is a bit of disagreement whether this is better addressed by improving the implementation of stable variables so that large amounts can be stored without these problems, or by providing alternative way to use stable memory.

But in either case, we should prevent people who use Motoko as it exists now from running into a very unfortunate unupgradeble canister situation (and I do think that safetly here is more important than providingn a scalable alternative).

I see two possible safeguards:

We serialize all the stable variables to stable memory at the end of each message.

This makes stable vars quite more cycle-expensive, and (because normal messages have a lower instruction limit than upgrade messages) lowers the implicit stable varible size limit further.

But it is also very reliable: pre_upgrade woudn't have to do anything anymore, so it simply cannot happen that a canister becomes non-upgradeable due to a trapping pre-upgrade. To rust developers I recomment avoiding preupgrade, so I'd feel better if we could do the same.

Canisters that grow their stable varible data without bound will still trap, but they will trap in the message that adds the data. This is fine and expected, and no different than running out of normal memory. In particular, they can still upgrade to a canister that fixes these problems.
Slightly less radical: At the end of each message, calculate the size of the stable memory, and trap if it above a certain limit (maybe 10MB?) that we hope and assume will always be possible to serialize in pre-upgrades.

Benefit: Less cycle costs than above. May be easier to implement (see below). Also, a low limit will be better at nudging developers to not use stable memory for lots of data until we actually suppor that.

Downside: We'd need to guess the limit, and it might be wrong as the system changes. Serialization size may not be a great predictor of cycle cost.

This would be a relatively simple change if it were not for this issue: Currently our stable varible serialization code assumes the heap gets discarded afterwards, and uses that for a few dirty tricks related to shared mutable values (that feature might have blown the complexity budget...). In order to non-destructively calculate the size we'd need a separate set data structure to mark seen elements (maybe a bitmap like the GC?), or maybe use high bits in the object header in a clever way. In order to non-destructively serialize we might need even more.

So this is some amount of work, but maybe worth it? The status quo (every canister that uses stable variables with growing data eventually becomes irrevocably unupgradeable) is kinda bad.

crusso commented 2 years ago

Alternatively, what if we introduced (optional) manual stabilization and destabilization instead, imposing a protocol on upgrade. Then we could spread the cost of serialization across several messages before the actual upgrade.

rossberg commented 2 years ago

To expand on what @crusso said (since we briefly discussed it the other day): the idea would be that a canister can provide explicit methods for copying their heap to and from stable memory. These methods are always careful to stay within cycle limits, but might have to be invoked multiple times until all work is done. They would be "manually" iterated by dfx as part of the upgrade procedure. The Motoko runtime would ensure to reject all other methods while this is in progress.

One question here is how this would interact with outstanding replies. If we want to run these manual methods, we probably can't stop the canister. Do we wait for completion of all replies first?

crusso commented 2 years ago

I don't even think manual iteration by dfx would be necessary - a single call to stabilize() that entered a stabilizing mode before issuing nested self messages (each getting their own budget) should be enough (c.f. https://github.com/dfinity/motoko/issues/2970).

You'd have to be careful about postupgrade too. A manual destabilize() method could be used to force destabilization() just after upgrade or update methods would need to lazily destabilize variables possibly across several messages. I guess queries would just need to transiently fail until destabilized.

A bit ugly. And still doesn't protect against candid immutable DAG bombs, but I guess that's a separate issue.

rossberg commented 2 years ago

Yes, good point about the iteration. Though I'm still somewhat confused about our messaging model. Wouldn't such recursion potentially run into call graph limits? Or did we get rid of those?

And yes, I am certainly assuming that the inverse would have to be done post-upgrade.

nomeata commented 2 years ago

We have no limits there (and never had)

It would require resumeable serialization; that'd be a refactoring (we can't use the stack easily).

Anyways, the point of this issue is to find a the easiest way to prevent unupgradeable canisters given our current capacity limitations, not (necessarily) lifting these limitations.

nomeata commented 2 years ago

In order to non-destructively calculate the size we'd need a separate set data structure to mark seen elements (maybe a bitmap like the GC?)

Easiest way might be:

In buffer_size, to mark seen objects, don’t override the full header, but just the last bit (or negate the header maybe). This way, the old value is preserved.
Add a function that undoes this transformation, to get an intact heap again.

Yes, its two traversals, but I am not too worried about cycle costs here – the point is after all to nudge people to not have large stable data for now. Maybe I’ll prepare a PR later today.

nomeata commented 2 years ago

On second thought, it’s not so great either. Just measuring the size at the end of a message will only be a heuristic, and not guarantee anything, so maybe not great.

And rewriting the full serialization code to be non-destructive is trickier.

dfinity / motoko

Limit size of stable variables, keep canister upgradeable #2978