dfinity / interface-spec

IC Interface Specification
https://khsfq-wqaaa-aaaak-qckvq-cai.icp0.io/docs
37 stars 20 forks source link

feat: a new install_code mode #135

Closed roman-kashitsyn closed 1 year ago

roman-kashitsyn commented 1 year ago

This PR describes a new variant of the upgrade mode for install_code that allows canister controllers to skip the pre_upgrade hook during the upgrade.

This upgrade method can serve two purposes.

  1. Act as an escape hatch if the canister entered otherwise irrecoverable state due to a critical bug in the post_upgrade method of the new canister module. The stable memory must already contain the entire state for the "fix" to work.

  2. Canister developers can use it during canister deployment to excercise the pre_upgrade method of the new canister.

Recovering from buggy post_upgrade hooks

Imagine that the post_upgrade method implementation has a bug that bricks the canister. Upgrading the canister in such a state is either impossible (pre_upgrade method traps unconditionally) or can result in a data loss (the post_upgrade method erroneously discarded important data, and executing pre_upgrade would corrupt the original). The stable memory, however, might still contain the correct state that the old module recorded.

If developers ever find themselves in such situation (I did), the upgrade install method with skip_pre_upgrade is the only salvation.

Testing the pre_upgrade hook

Testing pre_upgrade hooks is a constant source of struggle. If this method contains a bug, the canister can become unupgradable.

The upgrade method with skip_pre_upgrade provides a method to excercise the pre_upgrade method of the new canister module without the fear of data loss using the following procedure:

  1. Stop the canister.
  2. Upgrade the canister to the new module.
  3. If upgrade succeeded, upgrade the canister again to the same module. This upgrade will excercise the pre_upgrade hook.
  4. If the second upgrade succeeds, proceed with the next step. Otherwise, install the old canister module in upgrade mode with skip_pre_upgrade.
  5. Start the canister.
netlify[bot] commented 1 year ago

Deploy Preview for ic-interface-spec ready!

Name Link
Latest commit 211c3f2de00bfb5e0e7570d1429a51b4bc8c8189
Latest deploy log https://app.netlify.com/sites/ic-interface-spec/deploys/64c2487d23b06e0008c7fe09
Deploy Preview https://deploy-preview-135--ic-interface-spec.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Dfinity-Bjoern commented 1 year ago

This is currently still blocked by the Candid bug that prevents backward-compatible decoding of extended variants.

crusso commented 1 year ago

Testing the pre_upgrade hook

Testing pre_upgrade hooks is a constant source of struggle. If this method contains a bug, the canister can become unupgradable.

The upgrade method with skip_pre_upgrade provides a method to excercise the pre_upgrade method of the new canister module without the fear of data loss using the following procedure:

  1. Stop the canister.
  2. Upgrade the canister to the new module.
  3. If upgrade succeeded, upgrade the canister again to the same module. This upgrade will excercise the pre_upgrade hook.
  4. If the second upgrade succeeds, proceed with the next step. Otherwise, install the old canister module in upgrade mode with skip_pre_upgrade.
  5. Start the canister.

@roman-kashitsyn that might work, but isn't it simpler to do a trial upgrade to a canister that deliberately traps in post_upgrade? Then you go inspect response to see if you got the deliberate trap, or some other trap, indicating that something went wrong.That doesn't address the issue of how to fix a broken pre_upgrade (which this proposal might) but does let you test it non-destructively.

Cut'n'pasting a recipe I wrote up some time ago:

Testing upgrades

A developer might want to test an upgrade will complete successfully without losing state before performing the actual upgrade. One way to do this is to arrange for the canister to do a dry-run of the upgrade, that is guaranteed to fail with a trap in post_upgrade. After verifying the dry-run could complete successfully, the developer can perform the proper upgrade, this time without trapping, to commit to the change.

One way to code such a dry-run is to ensure that the new canister traps, either explicitly as the last step of canister/actor initialization or in a dedicated postupgrade Motoko system function. For added assurance, the canister could verify the restored state before trapping. By employing library function Debug.trap(msg), the developer can return a diagnostic error message, msg, along with the trap, to, for example, report the condition of the state post upgrade, as visible from the end of actor initialization or, optionally, within system function postupgrade.

Here’s a simple example of Counter class that can verify its own upgrade before committing to it:

https://m7sm4-2iaaa-aaaab-qabra-cai.ic0.app/?tag=2623570777

import Debug "mo:base/Debug";

actor class Counter(upgradeMode: {#verify; #commit}) {

    stable var version = 0;
    stable var state : Int = 0;

    func validate() : Bool {
        state >= 0;
    };
    public func inc() : async Int {
        state += 1;
        return state;
    };
    public query func read() : async {version : Nat; state: Int} {
        {version; state};
    };
    system func postupgrade() {
        version += 1;
        if (upgradeMode == #verify) {
          Debug.trap(debug_show 
            #verify {version; state; valid = validate()});
        };
    }
}
oggy-dfin commented 1 year ago

I think this is an improvement over the status quo so we should go through with this (if we were doing it from the scratch I'd advocate removing this whole state management business from the System API and pushing into a "user space" library, but that ship's long sailed). I don't have a strong opinion on evict vs. upgrade with flags. From a UX perspective I dislike skip_pre_upgrade being an opt bool instead of a bool given that the arguments are already an opt, but from what I understand from this choice decreases the chances of something going wrong with the NNS, so fine, let's keep it.

Before merging this we should reflect the changes in the abstract state transition part (unless we've decided to drop that part from the spec).

mraszyk commented 1 year ago

Before merging this we should reflect the changes in the abstract state transition part (unless we've decided to drop that part from the spec).

Good point! I've updated the formal part of the spec to not block this MR on the decision whether we keep it or not.

Dfinity-Bjoern commented 1 year ago

Note from the interface spec meeting: Do we have an actual reason for the opt bool? @roman-kashitsyn wrote "just in case" but we actually weren't sure whether or not that was a joke.

roman-kashitsyn commented 1 year ago

Note from the interface spec meeting: Do we have an actual reason for the opt bool? @roman-kashitsyn wrote "just in case" but we actually weren't sure whether or not that was a joke.

Just in case you want to deprecate this field later.

crusso commented 1 year ago

Note from the interface spec meeting: Do we have an actual reason for the opt bool? @roman-kashitsyn wrote "just in case" but we actually weren't sure whether or not that was a joke.

Just in case you want to deprecate this field later.

I could be wrong, but I think because it occurs in negative position, as an argument to install_code, you could just promote to reserved or remove the field from the record anyway. But I always get this wrong.

roman-kashitsyn commented 1 year ago

Note from the interface spec meeting: Do we have an actual reason for the opt bool? @roman-kashitsyn wrote "just in case" but we actually weren't sure whether or not that was a joke.

Just in case you want to deprecate this field later.

I could be wrong, but I think because it occurs in negative position, as an argument to install_code, you could just promote to reserved or remove the field from the record anyway. But I always get this wrong.

I'm fine either way. The more I use Candid, the more I prefer using optional fields everywhere...

crusso commented 1 year ago

Whatever we do about the argument, does type change_details in ic.dic needs to evolve too, or is the omission of the upgrade payload deliberate? I imagine change_details occurs in a positive position, which might force the choice of refining the upgrade payload versus adding a new variant (in favour of refining the upgrade payload.).

mraszyk commented 1 year ago

Whatever we do about the argument, does type change_details in ic.dic needs to evolve too, or is the omission of the upgrade payload deliberate? I imagine change_details occurs in a positive position, which might force the choice of refining the upgrade payload versus adding a new variant (in favour of refining the upgrade payload.).

Good point! We can either leave change_details as they're defined now (meaning we don't expose the fact whether pre_upgrade was run, effectively treating it as an implementation detail) or extend the type of the upgrade variant in change_details to opt record {...} which is fine because null (the current implicit type of the upgrade variant in change_details) is a subtype of opt record {...}.

Dfinity-Bjoern commented 1 year ago

In interface spec meeting: moving from bool to reserved would also work, so it seems using bool instead of opt bool seems preferable. We could alternatively use opt null, which may be more natural (then it would be null or opt null).

netlify[bot] commented 1 year ago

Deploy Preview for ic-interface-spec ready!

Name Link
Latest commit 7ae270e1c157dce08795e1dfb10de0f226d51f36
Latest deploy log https://app.netlify.com/sites/ic-interface-spec/deploys/64dcdefb765eea0008af839b
Deploy Preview https://deploy-preview-135--ic-interface-spec.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

mraszyk commented 1 year ago

I could be wrong, but I think because it occurs in negative position, as an argument to install_code, you could just promote to reserved or remove the field from the record anyway. But I always get this wrong.

What we missed though is that we need both forwards- and backwards-compatibility when deprecating those fields and thus we need an opt ....