dfinity / motoko

Simple high-level language for writing Internet Computer canisters
Apache License 2.0
517 stars 97 forks source link

Queue + failing heartbeat + stopping canister = death spiral #3275

Open paulyoung opened 2 years ago

paulyoung commented 2 years ago

I’ve been trying to help come up with a workaround to the problem encountered in this forum post:

https://forum.dfinity.org/t/queue-failing-heartbeat-stopping-canister-death-spiral/13328?u=paulyoung

I think the solution is probably to change heartbeat to consider the canister status, but I thought it might be worth sharing here in case anyone wanted to try and address it for Motoko specifically.

cc @crusso who worked on https://github.com/dfinity/motoko/pull/2677

crusso commented 2 years ago

I think I can reproduce the "death spiral" like so:

https://m7sm4-2iaaa-aaaab-qabra-cai.raw.ic0.app/?tag=2447745723

actor {

   var beats = 0;
   var done = false;

   public func exit() : () {
       done := true;
   };

   public func rec_await() : async () {
       if (done) return;
       await rec_await();
   };

   system func heartbeat() : async () {
       beats += 1;
       await rec_await();
   };

   public query func getBeats() : async Nat {
       beats;
   }

}

Start the canister, wait a while and then try to upgrade and you should see the failure.

No good ideas how to address this though.

chenyan-dfinity commented 2 years ago

Addressing this would require detecting termination? I think the only way out of this is to update the freezing_threshold to a large number, so that it stops early.

nomeata commented 2 years ago

The problem isn't really the heartbeat, the rec_await is already enough to get into that state, no matter how you call it. Or simply loop await async {}.

crusso commented 2 years ago

Yeah, fundamentally, the problem is bi-directional messaging and our whole notion of async/await still waiting.

I wonder if we should make status observable via some primitive so programmers can accommodate, or make await fail synchronously when the canister is stopping but I imagine the same design choice affect the design of ic0.call_perform - I assume that doesn't fail synchronously when stopping, but could, right?

nomeata commented 2 years ago

It doesn't, because maybe it is more important to finish a task than to stop completely. That's the point of stopping really. The system shouldn't force a certain possible dangerous behavior on canisters. But canisters or CDKs are free to choose a different behavior, e.g. stop issuing calls when in stopping mode.