Closed crusso closed 1 day ago
@nomeata FYA
When spreading GC across multiple messages, what do we do with incoming messages (calls and responses) that arrive in between? Fail them?
I'm not sure though what the code above tells us?
It tells us that we can pretty much do an arbitrary amount of computation if we cut it up.
For incomings, I think we either fail them with an error (not trap) right away or hijack them first to try to complete GC before attempting the work and proceeding, or if GC still not complete, failing with an error (but some GC progress).
Not great, but probably not likely to occur very often. And all our users are coding to handle failures, right?
Also, we could possibly use the heartbeat to do a major gc and maybe even stabilization.
But maybe I'm just overestimating the development cost of doing incremental GC, and it's just around the corner.
I think we either fail them with an error (not trap) right away or hijack them first to try to complete GC before attempting the work or failing. Not great, but probably not likely to occur very often. And all our users are coding to handle failures, right?
Hm, I'm not sure if we can be comfortable with that assumption. Failure should probably be exceptional and not happen during "regular" operation.
Could we somehow resend incoming messages to self if GC is still active, so that they are deferred?
Resending should be possible: if at the beginning of the message a GC is in process, do a self call, do more GC therein, and then continue in the callback (it has to be the callback to be in the right call context).
Querys can't do that, so they either fail, or we have to make sure that a partial GC still has a useful heap (even if mutation then aborts the GC).
Oh, same with doing upgrades! We can't defer them.
Overall, more complex than is seems probably. How far are we from an incremental GC, even if naive and not optimized (our naive copying GC served us well longer than expected)? And if manpower was not the problem, would we even know what to implement?
I think @ulan has some algorithms in mind but I expect all of these are significantly more work than trampoling off the scheduler on entry/exit before the mutator gets to run.
Still, the corner cases. For example, a stopping canister will now behave very oddly, as these self-calls will then fail.
Still, the corner cases. For example, a stopping canister will now behave very oddly, as these self-calls will then fail.
I guess one could cause any upgrade to fail until the canister is restarted and has had time to complete GC. But yeah, ugly.
Hopefully, https://github.com/dfinity/motoko/pull/3837 provides a sustainable solution.
Solved by the incremental GC.
Got too bored being sick so I did an experiment to see if we could spread the cost of a full gc across several self calls (as I suspected). This means that, should incremental GC be too much of reach at the moment, given resourcing, then we could consider still doing full GCs but just spread across a few messages. Our main worry about a full GCs it that it might exhaust the cycle budget, leaving a canister stuck.
Here's the code I'm using: https://m7sm4-2iaaa-aaaab-qabra-cai.raw.ic0.app/?tag=1302663172
It has a:
work(mb)
function that does some configurable computational work (with requiring the gc) (alloc a large array of mb MiB and reverse in place).maximizeWork()
function that finds a choice ofmb
that will fit into a single call budget,go()
that finds the maximal work amount and iterates the work at that amount, until someone externally calls done() to exit the loop. Seems like we can do quite a few iterations of the outer loop without running out of cycles... (edited)the functions do some logging to a text variable that is retured by go: