conservative stack scanning?

StefanKarpinski commented 9 years ago

Instead of maintaining our own GC stack frames – which is slow and easy to mess up, likely accounting for a large portion of our tough-to-track-down core Julia bugs in the past years – we've occasionally talked about doing conservative stack scanning. That is, scanning the stack and conservatively assuming that anything that happens to look like a pointer to a valid heap-allocated Julia object actually is.

Pros:

much simpler
faster (https://github.com/JuliaLang/julia/pull/11508, https://github.com/JuliaLang/julia/pull/11284)
makes stack allocation much easier (https://github.com/JuliaLang/julia/pull/8134)
immutables with references trivially become as efficient as reference-free immutables
fewer segfaults
Cons:
might retain objects that should be freed
pointers passed into C code could still expect objects to live but not point to the head of the object
performance of stack scanning

The second con seems like it can be avoided with the same kind of rooting we do now, except only needed at the ccall entry point. The chances of the first con happening – i.e. random values in the stack pointing valid heap objects – seems pretty slim. There might be situations like when you take a pointer to an array and then use it – but in that case you probably want the array to remain rooted while someone has a pointer to it. Are there other potential cons that I'm not seeing?

cc: @carnaval, @vtjnash, @JeffBezanson

yuyichao commented 9 years ago

The only thing I know about conservative vs precise stack scan is based on this and I quote.

...... Language implementations with automatic memory management often start out using exact rooting. At some point, all the careful rooting code gets to be so painful to maintain that a conservative scanner is introduced. The embedding API gets dramatically simpler, everyone cheers and breathes a sigh of relief, and work moves on (at a quicker pace!)

Sadly, conservative collection has a fatal flaw: GC things cannot move. If something moved, you would need to update all pointers to that thing — but the conservative scanner can only tell you what might be a pointer, not what definitely is a pointer. Your program might be using an integer whose value happens to look like a pointer to a GC thing that is being moved, and updating that “pointer” will corrupt the value. Or you might have a string whose ASCII or UTF8 encoded values happen to look like a pointer, and updating that pointer will rewrite the string into a devastating personal insult.

I know that we are propably even further from compacting but I'd like to know what the experts on these think...

StefanKarpinski commented 9 years ago

Our large amount of C interop makes it fairly unlikely that we'll ever be able to do moving GC.

JeffBezanson commented 9 years ago

It's not clear that we need to move objects. If we did, it would already require a lot of work, since the current rooting API assumes non-moving. That article seems to assume that moving is required for generational GC, but that isn't strictly true. You only need it for compacting or for a copying implementation of generational gc.

For us it's possible the conservative stack scan would be a bigger win than moving objects, especially given how important C interop is.

carnaval commented 9 years ago

To the pros I would add that we would be able to do many optimizations we do for bitstype on immutable types with no additional work. That's huge.

On the moving debate : we will never be able to do a fully relocating collector, because of C interop as Stefan said. However, it would be perfectly ok to do a so called "mostly copying" scheme, and we would need conservative stack scanning (to pin "C" objects) anyway.

The thing I don't agree on is that the chance of something on the stack looking falsely like a pointer is slim. The problem is that the stack is never cleaned up, so as it grows back you can end up overwriting parts (because of padding, alignment) of what was a pointer before. If you overwrite the LSB part you are on the contrary almost guaranteed to point inside a (different) julia object.

On the implementation side, the part taking care of pooled object is already "done" in my PR (which is where I encountered the partial pointer stuff btw). It's the easy one though because they are fixed size. To do this for "big" objects we need a datastructure to query if a pointer falls into a range of memory.

JeffBezanson commented 9 years ago

We can only properly evaluate false retention once we have a full implementation, and measure memory use in real workloads. Until then I'm not worried.

carnaval commented 9 years ago

As we already discussed, I'm not worried too much about the memory usage of false retention, but about the angry "my finalizer was not run after gc()" crowd. Anyway, you're right, we would need to implement big object conservative scanning first to see it in real life.

JeffBezanson commented 9 years ago

Well, you know how I feel about that. The only real solution is to use finalizers less and rely on them less.

ScottPJones commented 9 years ago

@JeffBezanson If you don't have reliable finalization, how do you handle externally managed memory reliably? Since C interop is so important, you need some way to be able to call C release code before something is GCed. Or am I missing something?

ScottPJones commented 9 years ago

@carnaval I agree that there'd be a high probability of seeing false positives of things looking like valid pointers still on the stack.

carnaval commented 9 years ago

"high probability" of seeing at least one when the programs run, sure. Now the thing which is hard to evaluate is : how long would false positive remain on the stack (after all if it is freed at the next call who cares), and with what density.

ScottPJones commented 9 years ago

Depends a lot on the application... if you have processes whose lifetime is measured in months or years, with the amount of stack fluctuating a lot, I think it might be more of an issue...

yuyichao commented 9 years ago

How reliable would it be to insert some mark on the stack to distinguish managed code and unmanaged code?

@carnaval Is the "almost moving" scheme a compacting collector that pins everything on the stack?

carnaval commented 9 years ago

You can mix conservative and precise stack frames, but the point of this is to get rid of the gcframe entirely, both in the runtime and the generated code. We can also have conservative scanning of C frames and precise scanning of julia frames using the llvm statepoint thing (more work).

The mostly copying stuff is a general technique, you just have a part of the roots you are not sure about. It can be the whole stack or only a part of it, or even some heap objects you wouldn't know the layout of (although we don't need this).

The canonical mostly copying (I know of) is http://www.cs.cornell.edu/home/fms/mcc/bartlett.html if you want to use copying as a marking segregation ("replacing" the marked bit / mark queue). Copying has other uses : compaction, generation segregation, ...

yuyichao commented 9 years ago

@carnaval Thanks for the explaination (and the pointer). I didn't know that the rooting can be fine grain classified like this before.

My feeling before was that compacting and the growing ability in LLVM to identify GC roots are the two things that are incompatible with conservative stack scan. Good to know that both of them are still possible. (Although they are still as hard to implement as they currently are now...)

StefanKarpinski commented 9 years ago

It's important that if we choose to do conservative stack scanning now we get all of the benefits immediately and can do more work to get back partial ability to do moving GC – which we're not even doing right now.

yuyichao commented 9 years ago

I totally agree and I find it very promising the first time @carnaval mentioned it to me.

It's just that when I want to find more detailed study about it, the internet is flooded with moving collector and there's not even many places mentioning non-moving generational collector.....

StefanKarpinski commented 9 years ago

The thing I don't agree on is that the chance of something on the stack looking falsely like a pointer is slim. The problem is that the stack is never cleaned up, so as it grows back you can end up overwriting parts (because of padding, alignment) of what was a pointer before. If you overwrite the LSB part you are on the contrary almost guaranteed to point inside a (different) julia object.

I'm trying to get a better understanding of this concern. Is the issue that you might have a pointer stored in a register or stack frame and then reuse the low byte or something like that and end up with the same register holding something that still looks like a pointer but points at something else entirely?

I should mention that at some point I had talked with @carnaval about wanting the ability to have interior pointers into an object pin the whole object (particularly useful with arrays). That would obviously make it much more likely that some accidentally pointer-like value would pin an object since you don't have to point exactly at the head of the object – pointing anywhere inside of it pins it. But I think the benefits of conservative GC outweigh that, so I'd be willing to forgo interior pointers for conservative stack scanning.

yuyichao commented 9 years ago

In that case, are we sure that the compiler optimization will not optimize a store of the original pointer out if we do not explicitly root it?

StefanKarpinski commented 9 years ago

Right, that's another concern – can we force LLVM to store pointers in a recognizable form on the stack?

yuyichao commented 9 years ago

But then this doesn't feel so much different from the current approach. I would expect a naive implementation to have the similar performance hit with the current rooting. It is probably a different story with LLVM gc support but I guess that's not even on the table at this point....

And we would also need to do the same with our c/c++ code.

JeffBezanson commented 9 years ago

You have to conservatively scan registers too.

simonster commented 9 years ago

jsc (WebKit's JS engine) actually uses a Bartlett-style collector with LLVM, so it can be done. Regarding false retention, their blog post says (under "Integrating LLVM with garbage collection"):

Retention of some dead objects is of course possible, and we can tolerate it so long as it corresponds to only a small amount of space overhead. In practice, we’ve found that so long as we use additional stack sanitization tricks – such as ensuring that our compilers use compacted stack frames and occasionally zeroing regions of the stack that are known to be unused – the amount of dead retention is negligible and not worth worrying about.

But Julia workloads are probably a bit different from JS workloads.

carnaval commented 9 years ago

Yep, you have to scan registers too. Then if you are careful to check for interior pointers (because llvm might transform a begining ptr + offset into a single pointer) I don't see what kind of transformation could hide the pointers from us.

LLVM would have to actively "encrypt" those on the stack, which while a legal transform, I doubt is common.

JeffBezanson commented 9 years ago

because llvm might transform a begining ptr + offset into a single pointer

Excellent point; I hadn't thought of that. That is indeed very troubling.

yuyichao commented 9 years ago

And I guess we gave to check for minus offset as well because of the tag?

carnaval commented 9 years ago

Well any pointer inside the memory block keeps it alive. The only problem is for objects which consists only of their tag where the pointer might look like it's keeping alive the next object in the pool. Those are only singleton instances right ? Then they are kept alive by the type object anyway.

yuyichao commented 9 years ago

Another question that just comes to my mind just now is that if we don't explicitly root local variables anymore, is it possible that the object becomes unreachable in LLVM (and makes LLVM decides to override the pointer) before it gets out of scope in julia code causing the finalizer to be called too early?

This is probably not an issue if we have better support for managing scope-local resources and do not provide any ganrentee when is the finalizer is going to be called #11207.

carnaval commented 9 years ago

I don't think it's an issue, we make no guarantees on when object are finalized except that you can never access the object after it has been finalized. In particular there is no mention of scope anywhere.

StefanKarpinski commented 9 years ago

So basically, if we want to do conservative stack scanning, we need to support interior pointer pinning, so we might as well use it, right? But then we have to deal with the fact that it's relatively easier for a random pointer-like value to pin an object accidentally by pointing anywhere into it. But there's no particular concern about a pointer to an object accidentally being "tweaked" to point into another object.

ArchRobison commented 9 years ago

Another con to consider: We will not be able to move stacks. This would (prematurely in my opinion) preclude some concurrency models.

Go's elegant support for concurrency never requires the user to declare stack sizes for "go routines", which are essentially concurrent coroutines. The Go run-time starts stacks small (8 KB if I remember right) and grows a stack by moving it to a bigger chunk of memory. This is straightforward to do with precise GC information for the stack. Go deals with C interop by always running C code on a thread's original stack, and not allowing C code to call Go code.

StefanKarpinski commented 9 years ago

Go deals with C interop by always running C code on a thread's original stack, and not allowing C code to call Go code.

I really don't think we want to/can do that – being able to pass Julia callbacks to C APIs is a killer feature.

ArchRobison commented 9 years ago

Maybe we can lighten up the Go constraint by just making sure (via escape analysis) that addresses of stack locations are not passed to C code. E.g., not stack-allocate objects whose address might be seen by C code.

ScottPJones commented 9 years ago

I agree that Julia callbacks to C APIs is critical (I was using that from the beginning of my Julia experiences). I hope that Arch's idea can work.

carnaval commented 9 years ago

Just so that we don't forget : @vtjnash stumbled upon the dual stack ("safe stack") support in LLVM while scrolling LangRef. We could hack this to have it use the safe stack only for gc pointers (and spills of gc pointers in registers) so that we have a conservative scanner which is almost not conservative. Don't know how crazy this is.

yuyichao commented 9 years ago

@carnaval Do you have a pointer to the feature or elaborate a little more? I only find a link to this clang doc

and spills of gc pointers in registers

Does this mean that the original pointers will be spilled and we don't need to check for interior pointers anymore? (assuming we still explicitely root them in c code)

vchuravy commented 9 years ago

SafeStack support landed two weeks ago in llvm upstream.

http://reviews.llvm.org/rL239761 http://reviews.llvm.org/rL240473 http://dslab.epfl.ch/proj/cpi/

DemiMarie commented 9 years ago

:-1: from me. The main difficulty is that it makes a fully moving collector impossible. Some advanced GC algorithms require this.

Furthermore, LLVM now has safepoint support. This allows for precise GC via stack maps compiled by LLVM. This should be faster than conservative collection.

rbehrends commented 7 years ago

I've been experimenting with conservative stack scanning in the Julia GC. I've implemented something that so far seems to work ok for single threads and has basic threading support [1]. It's not currently a pull request since it's still a work in progress (some help with figuring out the remaining problems would actually be great) and I see that this ticket is still marked "decision" anyway.

Performance seems to be so far unchanged; as I've only added new code that is called on top of the existing one, that is not surprising (it shouldn't have a huge impact). One goal is (obviously) to eliminate GC stack frames as a result and gain performance benefits from that. I haven't (yet) dug more deeply into that: for the time being, my goal was primarily to have solid evidence that this is feasible in Julia without reengineering the GC.

Would this be the right place to have a discussion about where to go from here? Is there still generally an interest in doing this, if the expected performance benefits materialize?

I'll note that there are two reasons why I've been working on that. One is that we seem to have some performance problems that seem to be related to the current explicit GC frames. The other is that I'm part of a large external project [2], where we would like to integrate Julia with various computer algebra systems, such as GAP [3]; the specific goal is to have Julia actually handle the memory for part or all of those systems. Having conservative stack scanning as an option would greatly facilitate this.

I can talk more about our application if anyone is interested, but it's not really important here (except that we are also interested in the related issue of having support for marking foreign objects in the Julia GC, for which I can open another ticket to discuss [4]).

Note also that some of the technical details are discussed briefly in doc/src/devdocs/stackscan.md.

[1] https://github.com/rbehrends/julia (branch: rb/conservative-stackscan) [2] https://wbhart.blogspot.de/2016/11/new-computer-algebra-system-oscar_20.html [3] https://www.gap-system.org/ [4] We are generally interested in having support for Julia's GC to handle foreign objects directly where possible in order to eliminate the performance cost of additional indirections and allocations.

yuyichao commented 7 years ago

I strongly disagree with having a conservative stack scan unless we've exhausted all possible other implementations.

There's no going back.
The benefit of doing it is minimum and all of the issues mentioned in the original post can be solved with precise rooting and are being worked on.

Having conservative stack scanning as an option would greatly facilitate this.

Why is that?

support for marking foreign objects in the Julia GC

An integrated way to do that is to declare corresponding julia type.

StefanKarpinski commented 7 years ago

@rbehrends: first of all, welcome – it's cool that you've already had some success with trying out conservative stack scanning. Please don't take @yuyichao's thumbs down as anything but a disagreement with the means; it's merely a technical judgement that conservative stack scanning is not the best way to achieve shared performance goals. @yuyichao: you've mentioned on a number of occasions that you don't believe conservative stack scanning is necessary or the best way forward, but I don't really understand what your intended alternative is. So that we can get everyone on the same page and not have people like @rbehrends wasting their efforts, could you possibly write up a short description of your plan for addressing this issue? Perhaps we can redirect some of these efforts towards making that happen then.

yuyichao commented 7 years ago

I don't really understand what your intended alternative is

What issue?

StefanKarpinski commented 7 years ago

Your intended alternative to conservative stack scanning – what is it? A written plan for this issue would be welcome.

yuyichao commented 7 years ago

Your intended alternative to conservative stack scanning.

I mean alternative to conservative stack scanning to fix what?

And for the "pros" in the first post

much simpler

Not necessarily. We also need to make sure finalizer are not run too early since objects can have uses invisible to LLVM.

faster (#11508, #11284)

Minimum after Keno's PR.

makes stack allocation much easier (#8134) immutables with references trivially become as efficient as reference-free immutables

As mentioned in stackoverflow.com/questions/43932741/stack-allocation-of-isbits-types-in-julia this is the easiest part of anything. So any "effort" should go to the typeinf and codegen part.

fewer segfaults

With aggressive testing and static analysis tool I'm confident enough about the current rooting that I think random compiler transformation hiding pointers might cause more segfault than whatever missing root we might still have.

StefanKarpinski commented 7 years ago

What I'm saying is that we need a public document somewhere that lists these issues and provides a plan of how they are all going to be handled. I'm not saying that the information doesn't exist somewhere in the universe, but it is currently scattered across the internet and the heads of a half dozen people. It should be in one place where people outside the "inner circle" can see it, react to it, and most importantly help make progress towards it.

yuyichao commented 7 years ago

There is no public doument related to this since this is not in the plan anywhere AFAICT. In fact, the stack overflow response is the first time in months that I remembered that we still have an issue that we shouldn't do.

The multiple issues linked in this issue all have their own issue/prs for discussion.

faster (#11508, #11284)

Well, exactly those, and the TODO would be to just revive the PRs. I think I have a pretty detailed explaination about what I did in the PRs and can certainly response to questions there in case anyone want to do it.

makes stack allocation much easier (#8134)

Explained in that PR and the linked PR at https://github.com/JuliaLang/julia/pull/12205 (only the last commit in the current commit list in that PR is real) which doesn't need conservative scanning. What needs to be done there in the GC really depends on what codegen want to do so there's no independent description of what's the necessary changes needed in the GC. Similar to other issues, anyone that want to work on that issue should just comment there instead.

A possible implementation of it would just be

diff --git a/src/gc.c b/src/gc.c
index 371919569b..1eb827e50b 100644
--- a/src/gc.c
+++ b/src/gc.c
@@ -1809,8 +1809,16 @@ stack: {
                 else {
                     new_obj = (jl_value_t*)gc_read_stack(&rts[i], offset, lb, ub);
                 }
-                if (!gc_try_setmark(new_obj, &nptr, &tag, &bits))
+                if (gc_ptr_tag(new_obj, 1)) {
+                    new_obj = (jl_value_t*)gc_ptr_clear_tag(new_obj, 1);
+                    nptr = 0;
+                    tag = (uintptr_t)jl_typeof(new_obj);
+                    bits = GC_MARKED;
+                    meta_updated = 1;
+                }
+                else if (!gc_try_setmark(new_obj, &nptr, &tag, &bits)) {
                     continue;
+                }
                 i++;
                 if (i < nr) {
                     // Haven't done with this one yet. Update the content and push it back

immutables with references trivially become as efficient as reference-free immutables

This is basically not true..... I realize this is not actually linked but this is https://github.com/JuliaLang/julia/pull/18632 FWIW, I don't think we actually have a consistent full picture of this matter yet.

fewer segfaults

And not true.

So basically it's only the existance of this issue that might be confusing people. The related issues are not particularly well documented since it usually take someone to actually trying implementing it to see what the issue actually is and what needs to be done which is also why a lot of the linked issues are actually PRs.

Given that @StefanKarpinski agrees to close this elsewhere I'll close this to reduce further confusion.

yuyichao commented 7 years ago

And reading through the issues/prs I linked above again, I realized that the issue of stack allocation in general that I know about is already pretty well documented in https://github.com/JuliaLang/julia/pull/18632, which also have a reference from the easier case at https://github.com/JuliaLang/julia/issues/6080#issuecomment-268859420 . Anyone that want to help with stack allocation should read that / comment there.

As for conservative scanning itself, I don't think we want to do that unless there's no other way to do something we want. https://github.com/JuliaLang/julia/issues/11714#issuecomment-301764461 . As an alternative GC, I don't think it's practical to have that unless someone want to co-maintain and regularly test the non-shared part.

Sacha0 commented 7 years ago

@rbehrends, another welcome and thank you for delving into internals! :)

JeffBezanson commented 7 years ago

@rbehrends Thanks for working on this. Glancing through your patch, it seems to be very high quality. This should be useful for comparing approaches to improving the GC. There are a lot of unknowns here, so having actual implementations will help a lot in moving things forward.

Keno commented 7 years ago

First of all thanks @rbehrends for posting that patch. It does seem like a very good patch (plus it has documentation!). However, I do agree with @yuyichao that conservative stack scanning is not a good way forward. I do think we should be able to do whatever we need with precise rooting, especially after #21888. Frankly, I'm not entirely convinced that it is possible to implement conservative stack scanning 100% correctly in the presence of arbitrary llvm optimizations (e.g. it might only store a derived pointer, or it may reuse some bits of a pointer, etc.).

JeffBezanson commented 7 years ago

It looks like the issues @rbehrends is raising here are

Performance --- should be addressed by #21888, and could be further addressed by stack maps.
Interop --- this seems to be the key one. Is there anything that can be done other than conservative scanning or manual JL_GC_PUSH everywhere?
Marking foreign objects --- has never been discussed AFAIK. Please feel free to open an issue.

JuliaLang / julia

conservative stack scanning? #11714

Pros:

Cons: