Spreading a full GC over several messages without doing incremental GC (just yet)

crusso commented 2 years ago

Got too bored being sick so I did an experiment to see if we could spread the cost of a full gc across several self calls (as I suspected). This means that, should incremental GC be too much of reach at the moment, given resourcing, then we could consider still doing full GCs but just spread across a few messages. Our main worry about a full GCs it that it might exhaust the cycle budget, leaving a canister stuck.

Here's the code I'm using: https://m7sm4-2iaaa-aaaab-qabra-cai.raw.ic0.app/?tag=1302663172


import Array "mo:base/Array";
import Error "mo:base/Error"

actor {

   stable var l : Text = "";
   func log(m : Text) { l #= m # "\n"};

   // alloc a temp array of mb MB and do inplace reverse (in a single call)
   public shared func work(mb : Nat) : async () {
       let bytes = mb * 1024 * 1024;
       let a = Array.init<Nat8>(bytes / 4, 0);
       var k = a.size() / 2;
       var i = 0;
       var j = k;
       while (i < j) {
          var temp = a[i];
          a[i] := a[k];
          a[k] := temp;
          i += 1;
          j -= 1;
       };
  };

  var done = false;

 // find largest amount of work that fits in one call
 public shared func maximizeWork() : async Nat {

     var mb = 64;
     loop {
        try {
            await work(mb);
            log(debug_show {trying_M = mb});
            mb := mb * 2;
        }
        catch (e) {
            log(Error.message(e));
            return mb / 2;
        };
     };
  };

  public shared func go() : async Text {
     done := false;
     var mb = await maximizeWork();
     while (not done) {
        try {
            await work(mb);
            log(debug_show {doing_MB = mb } );
        }
        catch (e) {
            log(Error.message(e));
            return l;
        };
     };
     let l0 = l;
     l := "";
     return l0;
  };

  public shared func stop() : async () {
     done := true;
  }
}

It has a:

work(mb) function that does some configurable computational work (with requiring the gc) (alloc a large array of mb MiB and reverse in place).
A maximizeWork() function that finds a choice of mb that will fit into a single call budget,
An a function go() that finds the maximal work amount and iterates the work at that amount, until someone externally calls done() to exit the loop. Seems like we can do quite a few iterations of the outer loop without running out of cycles... (edited)

the functions do some logging to a text variable that is retured by go:


{trying_M = 64}
{trying_M = 128}
IC0522: Canister 66ete-hqaaa-aaaab-qacrq-cai exceeded the cycles limit for single message execution.
{trying_M = 64}
{trying_M = 128}
IC0522: Canister 66ete-hqaaa-aaaab-qacrq-cai exceeded the cycles limit for single message execution.
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
{doing_MB = 128}
")

crusso commented 2 years ago

@nomeata FYA

nomeata commented 2 years ago

When spreading GC across multiple messages, what do we do with incoming messages (calls and responses) that arrive in between? Fail them?

I'm not sure though what the code above tells us?

crusso commented 2 years ago

It tells us that we can pretty much do an arbitrary amount of computation if we cut it up.

For incomings, I think we either fail them with an error (not trap) right away or hijack them first to try to complete GC before attempting the work and proceeding, or if GC still not complete, failing with an error (but some GC progress).

Not great, but probably not likely to occur very often. And all our users are coding to handle failures, right?

crusso commented 2 years ago

Also, we could possibly use the heartbeat to do a major gc and maybe even stabilization.

crusso commented 2 years ago

But maybe I'm just overestimating the development cost of doing incremental GC, and it's just around the corner.

rossberg commented 2 years ago

I think we either fail them with an error (not trap) right away or hijack them first to try to complete GC before attempting the work or failing. Not great, but probably not likely to occur very often. And all our users are coding to handle failures, right?

Hm, I'm not sure if we can be comfortable with that assumption. Failure should probably be exceptional and not happen during "regular" operation.

Could we somehow resend incoming messages to self if GC is still active, so that they are deferred?

nomeata commented 2 years ago

Resending should be possible: if at the beginning of the message a GC is in process, do a self call, do more GC therein, and then continue in the callback (it has to be the callback to be in the right call context).

Querys can't do that, so they either fail, or we have to make sure that a partial GC still has a useful heap (even if mutation then aborts the GC).

Oh, same with doing upgrades! We can't defer them.

Overall, more complex than is seems probably. How far are we from an incremental GC, even if naive and not optimized (our naive copying GC served us well longer than expected)? And if manpower was not the problem, would we even know what to implement?

crusso commented 2 years ago

I think @ulan has some algorithms in mind but I expect all of these are significantly more work than trampoling off the scheduler on entry/exit before the mutator gets to run.

nomeata commented 2 years ago

Still, the corner cases. For example, a stopping canister will now behave very oddly, as these self-calls will then fail.

crusso commented 2 years ago

Still, the corner cases. For example, a stopping canister will now behave very oddly, as these self-calls will then fail.

I guess one could cause any upgrade to fail until the canister is restarted and has had time to complete GC. But yeah, ugly.

luc-blaeser commented 1 year ago

Hopefully, https://github.com/dfinity/motoko/pull/3837 provides a sustainable solution.

luc-blaeser commented 1 day ago

Solved by the incremental GC.

dfinity / motoko

Spreading a full GC over several messages without doing incremental GC (just yet) #2970