Collect garbage when the heap is full

osa1 commented 4 years ago

Currently Motoko programs only do GC in these places:

In canister_init: first function invoked by the system after installing a canister
In post_upgrade: Invoked by the system after upgrading a canister
After responding to a message (but not if a message is a query? I don't understand this part)

In particular Motoko doesn't do GC if a single message the program fills the heap. Here's an example (run with run.sh -d):

import Prim "mo:prim";

actor a {

    class range(x : Nat, y : Int) {
        var i = x;
        public func next() : ?Nat { if (i > y) null else {let j = i; i += 1; ?j} };
    };

    public func allocate() : async () {
        // Allocate 4GiB of arrays, 4KiB each. This shouldn't fill up the heap
        // as the arrays are not used and could be collected.
        for (i in range(0, 1024 * 1024 * 4)) {
            let a = Prim.Array_init<Nat>(1024, 0); // 4 KiB
        }
    };

};

a.allocate(); //OR-CALL ingress allocate 0x4449444C0000

//SKIP run
//SKIP run-low
//SKIP run-ir
//SKIP ic-ref-run

allocate allocates 4GiB of 4KiB-large arrays but doesn't keep any of them alive, so ideally this program should not cause heap overflow.

(Btw, this program fails with "heap out of bounds", which seems weird. I'd expect it to fail with "heap overflow" or something like that.)

To fix this we need to at least do a GC when the heap is full [1], in alloc_words [2]. Later we may want to do GC more often, in different places, but we have to do a GC when the heap is full to avoid trapping, so I think it's fine to start with just doing a GC in alloc_words when the heap is full for now. Once this is implemented adding more GC points will also be easier.

The main difficulty is to pass roots to the GC at the GC site. As far as I understand we currently don't need this because a property of the current GC sites is that they don't have any roots in locals, stack, or globals, so the only roots are those in the static_roots array.

I think we currently don't store pointers in globals (as far as I can see globals are only used for multi-value returns), so we only need to refactor our use of function locals and the Wasm stack. Because Wasm doesn't provide a way to traverse the stack or function locals we can't use "stack maps" or anything like that that tells the GC locations of pointers in Wasm stack, locals, and globals. I think we have no choice other than maintaining our own stack, for roots. The idea is that at a GC point the stack will have all dynamic objects (other than those pointed by the static_roots array) that the GC needs to retain.

I'm not sure how to best implement this stack. Some notes:

At a function call, the caller needs to push all its pointers that may be used after the callee returns to the stack. (this can be avoided if callee is known to not allocate, but it's probably a good idea to not make this any more complicated for now and assume every function can potentially allocate).
When returning, the function needs to pop its pointers off the stack.
Caller, when the callee returns, needs to load locals from the root stack, assuming that the function call did a GC. (would not be necessary in a non-moving GC)
We can still use locals and Wasm stack for pointers, but we need to make sure to do (1), (2), (3). Alternatively we could stop using Wasm stack for pointers and only use the root stack.
The root stack needs to be extensible, which I think means it'll have to be heap allocated and should allow chaining.
Knowing which locals will be potentially used after a function call at a call site requires liveness analysis. Maybe for now we could assume all locals except those that are out of scope are live (I don't know if this will be useful enough in practice, but I suspect it might). For example, in the program above, we shouldn't assume a will be live after an iteration and we should pop this off the stack before jumping to the loop header.
The liveness analysis will have to insert remove instructions/expressions for variables after their last use in a function. For this I think we'll have to maintain something like stack frames in runtime, and in compile time we need to map locals to their locations in the function's frame. For example, if I have (in a function)
```
let a = ...;
let b = ...;
let c = ...;
use(a);
use(b);
use(c);
```
We want to convert this to:
```
let a = ...;
let b = ...;
let c = ...;
use(a);
kill(a);
use(b);
kill(b);
use(c);
kill(c); // alternatively just pop the function frame as this is the end of the function
```
The IR instruction/expression kill needs to know location of variables in the current frame.

Allocating frame locations for variables is a problem similar to register allocation. We could implement something like "linear scan", but we simply extend the call frame when there isn't an available slot.

This is all very complicated, but I don't see any other way.. Maybe others do? Is there an easier way? Do everyone agree that this (doing GC when the heap is full) something we need to implement?

Adding GC to Wasm spec makes more sense to me now...

CC @nomeata @crusso @rossberg @ggreif

[1]: Current GC only uses stack space so it's currently fine to do a GC when we reach the 4GiB limit. [2]: All heap allocations go through alloc_words, which is implemented as a RTS function.

kritzcreek commented 4 years ago

How would this interact with tail-calls? (When they actually get implemented in the Wasm engines). Would we also have to recognize tail-calls to not blow our custom stack?

osa1 commented 4 years ago

I think we just have to pop the current root stack frame off the stack since we won't be returning to it, unless I'm missing something..

EDIT: Missed the second part of the question, yeah if Wasm call instruction automatically becomes a tail call when possible we'll have to recognize tail calls to make sure to pop the current frame off the stack.

nomeata commented 4 years ago

After responding to a message (but not if a message is a query? I don't understand this part)

The system discards the caniser state after a query (that’s the point of a query – they should be called “non-committing methods or something”), so why bother cleaning up.

so we only need to refactor our use of function locals and the Wasm stack.

Correct, just like everything that follows, including, unfortunately

This is all very complicated, but I don't see any other way..

I have to agree. Maybe Andreas has a good idea.

One other way is to wait for the Wasm GC proposal to be accepted, and then fully rely on the host GC. Would amount to a almost complete rewrite of the backend. I doubt that we can use GC types on DFINITY soon enough, though.

Do everyone agree that this (doing GC when the heap is full) something we need to implement?

I don't know actually. If we have an upper limit of how much allocation we expect in reasonable programs within a single message execution (which is cycle-bounded!), this limit might be less than 2GB.

Also, GC after each message has the benefit that costs are paied by whoever incurred them. If you defer GC, then it might hit some random other message. Not a big problem as long as we have “canister pays”, but I woudn’t close the door to “user pays (fairly)”.

I think what is more pressing than the ability to run GC within messages is to have a non-moving GC, and maybe one where you pay for garbage, not for live data, so that running it at the end of each message is cheaper.

rossberg commented 4 years ago

Yes, a shadow stack in memory is unavoidable. But FWIW, even compiling C requires one.

For GC, as you say, there are two options: either spill (live) roots before calls, or always put reference-typed locals on the shadow stack (only). The latter seems much simpler, since you don't need to do anything around calls, and liveness analysis is much less crucial. It might be less efficient in cases where accesses to reference variables dominate the number of calls in a function, but you can easily construct opposite examples as well. So for a first version, I would avoid premature optimisation and go for the simpler solution.

I don't know how urgent all this is at the moment. We should build support for it mid-term, but it's certainly not a top priority for Mercury. For that, I would focus on lower-hanging fruit for GC, if there are any -- or on other things altogether.

As for Wasm GC, don't hold your breath. There has been a frustrating amount of stalling and distraction, and I have no idea anymore when this will actually become a thing. Also, we do not yet have an idea how to persist a Wasm-internal heap on our platform, so there are more technical challenges ahead before we would be able to use it. Hence, we shouldn't plan for using it for Motoko in the foreseeable future.

nomeata commented 4 years ago

For that, I would focus on lower-hanging fruit for GC, if there are any -- or on other things altogether.

Maybe this order:

Build infrastructure to know how much we are paying for GC (and other things).

My best bet here (i.e. what I’d do): Get the Wasm instrumentation code from dfinity, put it in a shared place, and then extend it so that it actually tracks cycle consumption per function. Maybe even spit out data that can be turned into a flame graph (http://www.brendangregg.com/flamegraphs.html. This does not need to be particular fast, a slow down of a factor is ok. And this would be super useful for our our optimization work, and for our users.
Switch to a non-copying GC. We are fairly ceratin that that is more pressing than switching to a inter-message GC, given that our platform will penalize dirtying pages.
Maybe (or maybe depending on a compiler flag) skip GC at the end of a message if the heap is not very full yet.
… dunno yet …

How does that sound? Ömer, do these (or one of these) step sounds interesting to you?

crusso commented 4 years ago

I think a shadow stack is unavoidable, either maintained directly or constructed on calls by spilling, but also don't think we need in message gc right now and it might be risky to do for mercury.

I guess the other question is whether there are some obvious improvement to make to what we have now.

For example I always thought it was dubious to copy to-space over from-space. Do we need to do that rather than just logically flip the roles? I recall Joachim has some argument why this is good, but I don't remember what it was.

Maybe there is some other low-hanging fruit. If we tracked arrays of scalars somehow, could we avoid scanning the array contents and just blit the array on evacuation?

Could we relax the 2GB restriction by using stable memory as our to-space?

crusso commented 4 years ago

And then there was that idea of avoiding gc when the root-set hasn't changed and there have been no writes to the heap reachable on entry, detected using a write barrier. But maybe that was a bad idea or not worth the effort.

Even simpler, do we currently avoid doing a gc if there was no allocation? I guess that might leave garbage around in case the heap was just modified, but not extended.

crusso commented 4 years ago

Regarding diagnostics, we now have primitives for measuring cycle consumption. Maybe we could just measure what proportion of cycles are spent on mutation vs gc.

nomeata commented 4 years ago

All good ideas. But I would not invest any more work into our copying GC. Switching to a non-copying is inevitable, isn’t it?

Regarding diagnostics, we now have primitives for measuring cycle consumption.

Do we? (I don’t think ic0.canister_balance will update as you go: Instead, before the message gets executed, some MAX_CYCLES_PER_MESSAGE are subtracted from the balance, the message is run, and afterwards whatever you didn't use is added again. At least that’s what I think is happening.)

osa1 commented 4 years ago

@nomeata

Also, GC after each message has the benefit that costs are paied by whoever incurred them. If you defer GC, then it might hit some random other message. Not a big problem as long as we have “canister pays”, but I woudn’t close the door to “user pays (fairly)”.

We can still do GC after each upgrade message, we don't have to do it only when the heap is full.

I think what is more pressing than the ability to run GC within messages is to have a non-moving GC, and maybe one where you pay for garbage, not for live data, so that running it at the end of each message is cheaper.

What kind of non-moving GC do you mean? A simple mark-sweep will still visit the entire heap: mark pass will visit live data, sweep pass will visit dead. In that sense current GC is better actually. I guess there could be more smart ideas that only visits garbage, but I'm not aware of any. I think there's no way around visiting live data (either for marking, or for copying).

osa1 commented 4 years ago

@rossberg

either spill (live) roots before calls, or always put reference-typed locals on the shadow stack (only). The latter seems much simpler, since you don't need to do anything around calls, and liveness analysis is much less crucial

I think liveness analysis is completely orthogonal to where/how we store references. For example, if as you say we always put references to the shadow stack and never to Wasm stack or locals, we'll still have to pop a from the shadow stack after the loop body, otherwise in the next iterations when the heap is full we won't be collecting previous, unused, as.

Why do you think how/where we store references requires more or less precise liveness analysis?

osa1 commented 4 years ago

(I wish Github had threads ...)

@nomeata

Switch to a non-copying GC. We are fairly ceratin that that is more pressing than switching to a inter-message GC, given that our platform will penalize dirtying pages.

I'm not so sure. The problem is a non-copying GC will require a memory manager, with its free lists for objects of various sizes. This will have its overheads. I never implemented a memory allocator from scratch myself, maybe the overheads are not too bad, but there's no escape from fragmentation in a non-copying GC. Btw, here's an example memory allocator in AssemblyScript. As far as I understand they use reference counting for GC.

Secondly, I asked about costs of dirtying pages in #public-spec a while ago, but there weren't clear answers. So it's not clear what the tradeoffs should be here. Perhaps people will decide that dirtying pages is not too important and the costs will be negligible? It seems like we currently have no idea.

I think it makes sense to invest time into stuff that will be necessary regardless of what kind of GC we'll need (more on this below). Your (1) is one of these things. The shadow stack stuff for dynamic roots is also one of these, regardless of what kind of GC we implement the GC will need to know the roots.

How does that sound? Ömer, do these (or one of these) step sounds interesting to you?

Yeah these all sound interesting to me.

@crusso

For example I always thought it was dubious to copy to-space over from-space. Do we need to do that rather than just logically flip the roles? I recall Joachim has some argument why this is good, but I don't remember what it was.

There's a very simple way of fixing this without making the RTS too much more complicated, the general idea is called in-place compaction, and I know how to do it because I maintained GHC's in-place compaction implementation (even fixed a ~19 years old bug in it!), but I'm unable to implement it simply because I'm unable to change object layouts. My PR for trying that is #1931. The main problem is we have no debugger on the IC, so the only way to debug anything is with prints, but that's not good enough for a code generation or GC bug.

That's why I asked about debugger plans in #motoko last week IIRC. I'm simply unable to make progress without a debugger. Maybe it's something about me, I don't know...

Even simpler, do we currently avoid doing a gc if there was no allocation? I guess that might leave garbage around in case the heap was just modified, but not extended.

We don't have any checks before a GC, so no.

It'd be easy to implement this, but would the code path ever taken? For example, in Haskell it's almost impossible to not allocate. Maybe it's not like that in Motoko, I don't know.

rossberg commented 4 years ago

@nomeata:

Switching to a non-copying is inevitable, isn’t it?

Why is that?

@osa1:

Why do you think how/where we store references requires more or less precise liveness analysis?

Not strictly necessary, but when you need to save/restore all local variables on every call, it might be more pressing to minimise the set.

nomeata commented 4 years ago

Switching to a non-copying is inevitable, isn’t it?

Why is that?

My impression is that the system will charge a lot for changing pages, and it would be absurd to dirty all pages in a GC run. (I think of running on the IC as running on a SSD: it’s fast, but you want to avoid writing to too many pages).

What kind of non-moving GC do you mean? A simple mark-sweep will still visit the entire heap: mark pass will visit live data, sweep pass will visit dead. In that sense current GC is better actually. I guess there could be more smart ideas that only visits garbage, but I'm not aware of any. I think there's no way around visiting live data (either for marking, or for copying).

Reading is still much cheaper than actually writing the pages. Can we keep the marks elsewhere? Maybe in a compact bitmap somehow?

The shadow stack stuff for dynamic roots is also one of these, regardless of what kind of GC we implement the GC will need to know the roots.

Only if we need do to a GC within a message. Given that message execution must be short, I am not sure we do. And unless we know for sure we do, I’d rather not invest into a shadow stack (which will certainly be much more expensive than using the Wasm stack like we do now).

but would the code path ever taken

Unlikely. You allocate already to copy the serialized argument to memory, even if it turns out to be DIDL\0\0.

osa1 commented 4 years ago

As I briefly mentioned above in my experience the biggest problem with implementing any of these (a different GC strategy, memory allocator, shadow/root stack, ...) is we don't have a debugger for the IC. In #1931 I tried a very simple thing: I added one more field to object headers. The idea is that I want to find and update all hard-coded object sizes and offsets in the RTS and code generator, so that when something about an object size or layout changes I'll update one place and the code generator and RTS will still work.

I debugged it for about two weeks and fixed many bugs, but it still doesn't work. There's some assumption (I think in the code generator, not in RTS) somewhere that doesn't hold after the update, but I can't find it. Only debug tool I have is debug prints, which is not helpful in this case (or at least I don't know how to debug it with just prints).

So I think having a better debugging story is also important here. This is not just for Motoko implementors, Rust CDK users will also need debugging at some point.

@nomeata

Reading is still much cheaper than actually writing the pages. Can we keep the marks elsewhere? Maybe in a compact bitmap somehow?

Yeah keeping mark bit maps in a separate memory is possible. That's what we do in GHC's non-moving collector as we can't change object layouts because that'd require recompiling the entire universe and we don't want to add memory overhead when non-moving GC is not used.

crusso commented 4 years ago

All good ideas. But I would not invest any more work into our copying GC. Switching to a non-copying is inevitable, isn’t it?

Regarding diagnostics, we now have primitives for measuring cycle consumption.

Do we? (I don’t think ic0.canister_balance will update as you go: Instead, before the message gets executed, some MAX_CYCLES_PER_MESSAGE are subtracted from the balance, the message is run, and afterwards whatever you didn't use is added again. At least that’s what I think is happening.)

I'm not sure that's what I was seeing, but let me verify - perhaps I was observing between awaits.

nomeata commented 4 years ago

Reading is still much cheaper than actually writing the pages. Can we keep the marks elsewhere? Maybe in a compact bitmap somehow?

Yeah keeping mark bit maps in a separate memory is possible. That's what we do in GHC's non-moving collector as we can't change object layouts because that'd require recompiling the entire universe and we don't want to add memory overhead when non-moving GC is not used.

Sounds good. So I guess the priorities are

(1) a debugger (2) a profiler (3) a non-moving GC

rossberg commented 4 years ago

@nomeata:

My impression is that the system will charge a lot for changing pages, and it would be absurd to dirty all pages in a GC run.

Yes, but we don't have to spill the baby with the bath water. Perhaps what we should go for is a paged heap, where pages are collected one at a time, and evicted when a certain lower threshold is hit.

I’d rather not invest into a shadow stack (which will certainly be much more expensive than using the Wasm stack like we do now).

That is an assumption that many people have, but AFAIK there is no clear evidence for it. At least for C, the overhead of the shadow stack was hardly measurable, and likely dominated by other costs.

rossberg commented 4 years ago

Relevant paper:

https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf

There might be more recent ones, too, but I need to track them down.

crusso commented 4 years ago

There's a very simple way of fixing this without making the RTS too much more complicated, the general idea is called in-place compaction, and I know how to do it because I maintained GHC's in-place compaction implementation (even fixed a ~19 years old bug in it!), but I'm unable to implement it simply because I'm unable to change object layouts. My PR for trying that is #1931.

That sounds more complicated than what I'm thinking of. I'm literally just suggesting maintaining a from space and to-space pointer and swapping their roles, but maybe that's hard for some reason. I thought that was how simple copying collectors worked. Perhaps the obstacle for us is memory grow, and copying everything from to-space to from-space after gc makes it easy to grow memory without moving stuff around.

The main problem is we have no debugger on the IC, so the only way to debug anything is with prints, but that's not good enough for a code generation or GC bug.

Well, I'm sure Gabor would appreciate a hand with the debugger ;-> but I imagine you want to debug the RTS code, and I'm not sure his work will easily help with that.

nomeata commented 4 years ago

I'm literally just suggesting maintaining a from space and to-space pointer and swapping their roles, but maybe that's hard for some reason. I thought that was how simple copying collectors worked. Perhaps the obstacle for us is memory grow, and copying everything from to-space to from-space after gc makes it easy to grow memory without moving stuff around.

That works if you have a full virtual memory and can divide it into half. If we don't do that, we’d need a page allocator beneath, and/or we’d have to worry about fragmentation. Let’s move that sub-discussion onto https://github.com/dfinity-lab/motoko/issues/1670

crusso commented 4 years ago

All good ideas. But I would not invest any more work into our copying GC. Switching to a non-copying is inevitable, isn’t it?

Regarding diagnostics, we now have primitives for measuring cycle consumption.

Do we? (I don’t think ic0.canister_balance will update as you go: Instead, before the message gets executed, some MAX_CYCLES_PER_MESSAGE are subtracted from the balance, the message is run, and afterwards whatever you didn't use is added again. At least that’s what I think is happening.)

I'm not sure that's what I was seeing, but let me verify - perhaps I was observing between awaits.

I stand corrected. It's only if you await that you see the decrease in cycles. Sorry @osa1, not much use for you.

nomeata commented 4 years ago

This was discussed before, but JFR in this GC-brainstorming thread:

Maybe we can assume that it is cheap to modify unused pages and mark them as unused at the end of the message (i.e. the changes would not be persisted). That, and also typical workloads, might suggest that we want a simple copying (and compactifying) GC as the first generation, which cheaply gets rid of garbage that is not added to the persisted state; and then use a non-moving GC for whatever survives a message.

osa1 commented 3 years ago

I worked on a few different ideas last week for a new GC. Here's my notes and updates:

I started with implementing a mark-compact GC in branch osa1/compacting_gc_2 (based on #2210) that requires two data structures for bookkeeping:

A bitmap for object liveness. A bitmap is heap_size / 4 (min object size in bytes) / 8 (bits in a byte) bytes.
A mark stack. Mark stacks grow dynamically.

The GC doesn't move objects to a new space. Bitmap has a constant size so we allocate it first before a GC, then allocate the mark stack, which can grow dynamically. The heap after allocating the bitmap and mark stack looks like

| static data | heap | bitmap | mark stack |

Bitmap is implemented here and mark stack is implemented here. The implementations are reusable in other kinds of GCs and they are tested using quickcheck (1, 2).

The algorithm I wanted to implement is "threaded compaction" (algorithm 3.3 in The GC Handbook, also briefly described in my comment here). GHC uses the same algorithm for the oldest generation when residency passes a threshold. I fixed a few bugs in that code before and reimplemented some of the parts so I know the algorithm well. Good thing about this algorithm is it doesn't require extra space for compaction.

Unfortunately I discovered an issue that makes that algorithm difficult or impossible to use, at least directly: interior pointers to blob payloads in BigInt objects cause problems for this algorithm. The GC Handbook notes "Jonkers does not support interior pointers ..." in page 38. I had never thought about this before, I kinda always assumed no interior pointers, as I never had to deal with them before as GHC doesn't have interior pointers. I realized it when I tried to implement threading for BigInts. As the book also notes, it's possible to support this, but we need extra bits somewhere in the object and I think we don't have any such bits currently, but I'm not sure. I'm still thinking about this.

(Btw, @ggreif, do you know how GHC uses GMP? GHC doesn't have interior pointers so I wonder how does it allocate space for GMP objects and allow resizing blobs/byte arrays for GMP integers)

Other than this the algorithm is mostly complete in branch osa1/compacting_gc_2. Note that it's not tested at all. I'll try to complete it without the bigint support and test it.

I also experimented with using "visitors" in the RTS, to avoid writing the same pattern for visiting pointer fields of objects again and again. It works nicely for the current GC but I couldn't use it in the compacting GC as I'm not sure how to visit a BigInt when threading an object in the compacting GC. Once that's figured out I think we could implement zero-cost visitors for traversing an object.

Finally, I updated #1931 (rebased it on #2210), which works much better now (more tests pass). We don't really need it, we could always allocate the bookkeeping data in a new space, and use the bitmap for visiting only live or only dead objects, but it'd be good if I could update object headers easily (we could for example use one bit in the header for marking, or link all objects together and use malloc/free for allocation and deallocation). Interesting fact about that patch is it fails differently on my system and on the CI. It would be good to know what's causing that difference. I wonder if it could be another bug somewhere.

osa1 commented 3 years ago

Another update: I managed to fix the last bug in #1931 today, so I can now implement a very simple mark-sweep collector that does not move objects at all. It should also work fine with interior pointers. I'll try to do that this week unless something more important comes up at the NNS meeting today.

nomeata commented 3 years ago

Exciting!

osa1 commented 3 years ago

Regarding interior pointers in BigInt objects, I've been trying to understand how can GHC implement big integers (using GMP) without interior pointers while we can't (using tommath). GMP and tommath types are very similar so I was expecting GHC to have the same problems:

// gmp
typedef struct
{
   int _mp_alloc;
   int _mp_size;
   mp_limb_t *_mp_d;
} __mpz_struct;

// tommath
typedef struct  {
   int used, alloc;
   mp_sign sign;
   mp_digit *dp;
} mp_int;

I think I figured it out. As a reminder, in Motoko a BigInt is an object that holds the __mpz_struct for the number, and the pointer points to payload of a blob (interior pointer). This pointer causes problems in GC because for a lot of algorithms interior pointers are problematic (sometimes impossible to support depending on your object layouts).

GHC does not use this GMP struct at all. An Integer (our BigInt) is just a byte array for the mp_limb_t of the number. This works because GMP provides an API that works directly on mp_limb_t. For example here's the addition function:

typedef mp_limb_t *mp_ptr;
typedef const mp_limb_t *mp_srcptr;
typedef int mp_size_t;

__GMP_DECLSPEC mp_limb_t mpn_add (mp_ptr, mp_srcptr, mp_size_t, mp_srcptr, mp_size_t);

(return value is the carry)

Documentation for this low-level interface to GMP: https://gmplib.org/manual/Low_002dlevel-Functions

Crucially this API requires the caller to make sure the destination operand (first arg in mpn_add) has enough space, which is easy to do (just max size of args + 1 for addition, for example).

In our case, as far as I can see from the documentation, tommath doesn't provide such an API and it needs to be able to re-allocate the mp_digit array, so we are stuck. We have to provide mp_realloc that allocates a blob and returns a pointer to the payload of it (interior pointer), and track that pointer in the GC.

--

@ggreif do you know if tommath provides a lower-level API (maybe undocumented) for this?

osa1 commented 3 years ago

I think if we could pass mp_int structs with large enough mp_digit arrays we could make sure mp_realloc will never be called. Then we could implement BigInt objects as

| Header (BigInt) | used | alloc | sign | dp | byte array |

Where dp points to the next word (to byte array). So we remove the pointer to another heap object.

We also need to avoid using mp_init as it calls mp_calloc, which we also can't implement without returning an interior pointer. For that I think we'll have to copy and modify mp_init_size function: https://github.com/libtom/libtommath/blob/develop/mp_init_size.c

This is very hacky though. I wish tommath supported a low-level API like GMP's.

nomeata commented 3 years ago

I am bit lost why we have a problem at all: The only internal pointers are the ones in the 5ths word in BigInt object. The GC already knows, for each heap object type, which words are pointers and which are not pointers. The only thing we need to add is the knowledge that the 5th word in BigInt is a pointer that has an offset applied to it. Apply that offset when reading the pointer, and it becomes a normal pointer.

In general, internal pointers are hard (e.g. pointing to an arbitrary point in some array or structure). But we are not facing that general situation, do we?

osa1 commented 3 years ago

Many GC algorithms have problems with internal pointers, it's a well known issue (there are a lot of discussions in The GC Handbook). Jonkers (the in-place compaction algorithm I'm implementing in branch osa1/compacting_gc_2) is one of them. I think if you read the algorithm it should become clear why it's a problem. Hard to explain why it's a problem without describing the algorithm, and the algorithm is not trivial.

I have an idea of how to support it. I'm currently implementing it without BigInt support. I'll then experiment with supporting BigInts.

I also found the paper "Fast garbage compaction with interior pointers" which describes a way to support interior pointers in Jonkers, but it requires that the interior pointers point to some kind of header that are distinguishable from non-pointers and other headers, which is not the case in Motoko currently, so the idea is not directly applicable.

nomeata commented 3 years ago

Right, but I am arguing that we don’t have internal pointers, just pointers with varying offsets (-1 in most cases, 19 or something in this one case), so none of that should apply to us?

nomeata commented 3 years ago

Jonkers (the in-place compaction algorithm I'm implementing in branch osa1/compacting_gc_2) is one of them

Reading upon Jonkers (found a concince overview at https://wiki.haskell.org/Yhc/RTS/GC, hope it applies).

Are you saying that these “internal” pointers (well, pointers with a different offest) are a problem to Johnker’s idea of “threading”, because the code that creates or follows this threading doesn't have the luxery to know that the dp field is a pointer with a different offset? Hmm… I guess there might be tricks – the chain of pointers in such a thread are all pointers, so maybe one can spare a low bit to denote “different offset here”.

Anyways, is Jonkers even an attractive GC for is? With all the churn from threading pointers and compatifcation? The goal is to have as few pages dirtied as possible, so I am not sure if it is a much better fit than a copying GC (or is it the case that usually, most objects are neither touched because of threading, nor moved because of compactification?).

osa1 commented 3 years ago

Right, but I am arguing that we don’t have internal pointers, just pointers with varying offsets (-1 in most cases, 19 or something in this one case), so none of that should apply to us?

You're giving it a different definition and then saying we don't have it. An internal pointer as used everwhere is when you have a pointer somewhere other than another object header. In that sense we have internal pointers in BigInts.

Where do we have pointers with offset 19? I'm not aware of that case, and we don't have such a case handled in the current GC.

Reading upon Jonkers (found a concince overview at https://wiki.haskell.org/Yhc/RTS/GC, hope it applies).

I took a quick look and it seems to describe the algorithm I'm talking about, yes.

Are you saying that these “internal” pointers (well, pointers with a different offest) are a problem to Johnker’s idea of “threading”, because the code that creates or follows this threading doesn't have the luxery to know that the dp field is a pointer with a different offset?

Right.

Hmm… I guess there might be tricks – the chain of pointers in such a thread are all pointers, so maybe one can spare a low bit to denote “different offset here”.

Yeah, if we could somehow mark the pointer field in BigInts that would work. That's why I said "but we need extra bits somewhere in the object and I think we don't have any such bits currently, but I'm not sure. I'm still thinking about this" above.

Anyways, is Jonkers even an attractive GC for is?

I think it's great for our purposes. It allows bump allocation and doesn't require a new space to copy the live data (it does compaction in the same space) and doesn't require any bookkeeping (though marking requires bookkeeping for the mark stack/queue). I already implemented most of it, but it currently has bugs.

If our perf tests don't use BigInts we can benchmark it without implementing BigInt support.

nomeata commented 3 years ago

Where do we have pointers with offset 19? I'm not aware of that case, and we don't have such a case handled in the current GC.

Sorry, not 19, I guess I meant 8 (or 9 relative to the skewed pointer) – when you point to the Blob payload instad of the byte before the object header.

we need extra bits somewhere in the object

It might suffice to store the bit in the threading pointer to the “special” field, so when you are following the thread, you know that you have to restore it with the different offset:

Before Threading:

            ┌→
 [root]     │ [Bignum]         [Blob]
 [    ] ────┘ [size  ]         [size]
              [sign  ]      ┌─→[data]
              [      ] ─────┘

After threading:

 [root]       [Bignum]       ┌╴[ BB  ]
 [Blob]←────┐ [size  ]       │ [size]
            │ [sign  ]      ┌┘ [data]
            └╴[  AA  ]←─────┘

So it seems you’d need a bit in the pointer BB so that when updating this chain of pointers, you know that the location pointed to by BB doesn't take a skewed pointer (offset -1), but a payload-of-blob pointer (offset 8). Maybe that’s precisly what you had in mind, of course :-)

Not sure if an extra bit in the heap object helps you; from my understanding when you go through the list of threaded pointers, you don't know what kind of objects they are in.

I think it's great for our purposes … bump allocation … no new space

The primary problem with our GC, on our platform, is that it touches almost all pages. Which pages would Jonkers touch? Due to threading, maybe still most?

If our perf tests don't use BigInts we can benchmark it without implementing BigInt support.

We can’t measure dirtied pages at the moment. I would prefer to a have a theoretically convincing argument that the algorithm writes to few pages as possible.

OTOH, this GC is certainly an improvemente over the current GC, so not objecting to that, just saying that it might also not yet not be a great fit.

crusso commented 3 years ago

That makes sense to me. We essentially have two kinds of pointers for threading purposes and can use a single bit in the pointer representation to distinguish them. Right?

nomeata commented 3 years ago

Yes. Some complication is that we need to distinguish pointers from the data at the object head (which terminates the thread, and needs to be move back after resolving the thread). So we need to be able to distinguish pointers with bit, pointers without bit, and all the object tags. Should be possible, though.

Or we use the fact that all pointers to such a payload are mp_int structures. So if we use a separate object type (MPDigits instead of Blob) for the payload, we can knew that all pointers on this thread are of this form by looking at the object tag at the end.

Many options, I guess we can trust Ömer to pick a suitble one :-)

nomeata commented 3 years ago

We can continue this at https://github.com/dfinity/motoko/pull/2223 (thanks for opening that PR)

osa1 commented 3 years ago

Re: tagging the pointer BB in the example above: I think we need to tag AA (the value in the field which previously had the internal pointer) instead of BB, so that when we unthread the blob it points to and come across that field with value AA we know that we should put free + payload offset instead of just free. I think we can't look at the next field in the thread (for BB) because this field may be the last in the thread. Maybe I misunderstood the idea though.

So if we use a separate object type (MPDigits instead of Blob) for the payload, we can knew that all pointers on this thread are of this form by looking at the object tag at the end.

I like this idea the most. It's simple, and we have enough bits in object headers so adding one more object type is not a problem. I'll try this in my branch.

osa1 commented 3 years ago

So if we use a separate object type (MPDigits instead of Blob) for the payload, we can knew that all pointers on this thread are of this form by looking at the object tag at the end.

I like this idea the most. It's simple, and we have enough bits in object headers so adding one more object type is not a problem. I'll try this in my branch.

Unfortunately this won't work, unless we make sure that blobs for mp_int digits won't ever be referenced by other objects. I think this may currently hold, but I'm not sure. It seems difficult to make sure this will always hold.

Here's what happens when this doesn't hold. I have a MPDigits object (a blob with a different tag) which is pointed by a BigInt. If I also somehow point to it from, say, a MutBox, then when unthreading I also leave an internal pointer in MutBox's payload in addition to BigInt's. But we only want to leave an internal pointer in BigInt fields, not in other fields.

osa1 commented 3 years ago

Even if we make sure nothing points to a MPDigits object (only to its payload), the idea above is inefficient because before unthreading we need to get to the end of the thread to see if the tag is MPDigits. That's one extra traversal of the thread.

GHC actually does this (traverses threads twice), and there's a TODO comment about this from 2007 :-) The problem in GHC's case we don't know whether we can update fields in a thread to free or we have to switch to the next block (GHC heap is a linked list of 16KiB blocks). For example, if free is 0x1000000 and heap ends at 0x1000010 we have room for two (64-bit) words. If the current object is 4-words large then we can't just update the fields in the thread to 0x1000000, we need to switch to the next block and update the thread to the beginning of that block. But we don't know the current object's size without unthreading it first. So we make two passes: one to get its info table (from which we can compute the object size). Second pass to unthread the chain. The actual comment is here if you're interested :-)

crusso commented 3 years ago

What about just adding a word to the BigInt object that stores the (exterior) pointer to the Blob as usual, and just copy the interior pointer to the struct when doing the actual arithmetic with the C library. Or just fabricate the struct on demand.

osa1 commented 3 years ago

Hmm, I think that would work. We need to keep in mind that the C library can reallocate the buffer though (using mp_realloc that we provide). So after each call to the library (with a fabricated structs as arguments), we need to update the exterior Blob pointers from the structs we provided to the library, assuming that they could be updated (with mp_realloc). I will implement this and see if anything goes wrong.

crusso commented 3 years ago

Actually, there's probably no need for an additional field. You could just adjust the existing field before and after an operation.

crusso commented 3 years ago

Would the C library actually reallocate? Aren't the operations functional? Just curious...

osa1 commented 3 years ago

Would the C library actually reallocate? Aren't the operations functional? Just curious...

They're functional (they don't change inputs), but take an output argument. If the output doesn't have enough space the operation reallocates. Here's the type signature of addition:

pub fn mp_add(a: *const mp_int, b: *const mp_int, c: *mut mp_int) -> mp_err;

Third argument is the output.

Allocation of the buffer for "digits" (the number) is also done by the library, via mp_alloc that we also provide. Here's the full wrapper for the BigInt addition function:

#[no_mangle]
pub unsafe extern "C" fn bigint_add(a: SkewedPtr, b: SkewedPtr) -> SkewedPtr {
    let r = bigint_alloc();
    check(mp_add(
        a.as_bigint().mp_int_ptr(),
        b.as_bigint().mp_int_ptr(),
        r.as_bigint().mp_int_ptr(),
    ));
    r
}

bigint_alloc allocates a BigInt which keeps the mp_int struct, and then passes the struct to the library's mp_init function. mp_init allocates the space for the digits and updates the pointer field in mp_init (which points to the payload of a Blob). After that we end up with a BigInt object that points to payload of a Blob.

crusso commented 3 years ago

So I guess our mp_alloc is allocating a blob and returning the interior pointer to the blob payload?

osa1 commented 3 years ago

Exactly.

crusso commented 3 years ago

Ok, so it's always safe to calculate a blob pointer from the payload pointer "returned" by one of these ops. Cool.

Actually, there's probably no need for an additional field. You could just adjust the existing field before and after an operation.

Does that make sense then?

osa1 commented 3 years ago

Do you mean an additional field to point to the blob payload? (in addition to a pointer that points to the blob's header)

crusso commented 3 years ago

Well, originally I suggested adding a field to our blob object to record the exterior pointer, but we can just store that in the existing field and adjust before sending out the struct or receiving an output.

dfinity / motoko

Collect garbage when the heap is full #2033