JuliaLang / Juleps

Julia Enhancement Proposals
Other
67 stars 24 forks source link

Julep to enhance GC interface for foreign code. #50

Closed rbehrends closed 6 years ago

rbehrends commented 6 years ago

The support for modules written entirely or partly in foreign languages to interface with the Julia GC or to use the GC for allocations that do not neatly fit Julia's type system or use low-level approaches not available in Julia (such as irregular data structure layouts) is currently somewhat limited.

This proposal aims at fleshing out the API for allowing more complex interaction of foreign code with the GC, especially the use of long-lived foreign objects that are inextricably interwoven with Julia objects.

Keno commented 6 years ago

I haven't read this through in detail, but presumably you need a mechanism to record the current state of all big objects at the moment you turn on conservative stack scanning to make sure you don't miss any objects in your binary tree.

Another thing to watch out for is what happens when whole pages get dropped at a time. In particular you need to 1) Make sure not to consider objects in dropped pages 2) Perhaps re-zero a page that's put back into service after being dropped to make sure you don't have stale data that could get fed into the GC.

The compiler writer in me also continues to be concerned about the proposition that conservative stack scanning is sound. There's quite a bit of fire you're playing with there, especially as compilers continue to improve.

yuyichao commented 6 years ago

The compiler writer in me also continues to be concerned about the proposition that conservative stack scanning is sound. There's quite a bit of fire you're playing with there, especially as compilers continue to improve.

That's why I'm only fine with this since the conservative scan is all external.

In particular you need to

There are already code in gc-debug that does this. I didn't check carefully if it's used/if it's used correctly since that's also completely a user problem.

rbehrends commented 6 years ago

We do indeed need to record bigval allocations; this is what the notify_external_alloc and notify_external_free hooks are for.

I originally believed that we could get away without zeroing pages by having sufficiently exhaustive validity checks for object data, but your comment made me revisit that assumption, and you are right. While pages that are currently dropped already aren't considered, the code did not properly account for pages being reused.

I have added the necessary code to fix that oversight. Because that introduced some minor, but non-zero overhead, I made zeroing contingent on a flag (enabled by jl_gc_enable_conservative_scanning()) in order to avoid that overhead if conservative scanning features are not needed.

I am aware of how optimizing compilers can create issues for conservative garbage collectors. This is discussed in some detail in Boehm, H., and D. Chase, "A Proposal for Garbage-Collector-Safe C Compilation", Journal of C Language Translation 4, 2 (December 1992), pp. 126-141, for example. There are also other concerns, such as integers being falsely interpreted as spurious stack roots on 32-bit systems with large heaps.

The assumption that we make here is that conservative stack scanning is rarely a desired choice, but something that is typically forced upon us, often by the use of or need to interoperate with legacy or third-party code or by overriding engineering concerns. It's something that we have to live it, which is why the proposal only has it as an opt-in mechanism on an as-needed basis rather than a default assumption. So, yes, this is not an ideal solution, but rather a (sometimes) necessary one.

For example, one use case that we specifically require it for is interoperability with GAP, which assumes conservative stack scanning throughout, including many packages authored by external developers.

We do assume that compilers will not introduce changes (such as pointer masking) that would break the assumptions underlying conservative stack scanning. The reason is that too much existing code relies on conservative garbage collection (including, but not limited to any software that uses the Boehm GC) and such changes would break a lot of existing software. Therefore there exists a very strong disincentive for the major compiler vendors to introduce such changes (at least not without offering workarounds).

Keno commented 6 years ago

We do indeed need to record bigval allocations; this is what the notify_external_alloc and notify_external_free hooks are for.

That's not quite what I meant. I mean that you need some mechanism of retrieving all such allocations that were created prior to installation of these hooks.

Keno commented 6 years ago

Forgot to mention. You'll also need a mechanism to scan all threads' registers (including x/y/zmm).

Keno commented 6 years ago

We do assume that compilers will not introduce changes (such as pointer masking) that would break the assumptions underlying conservative stack scanning. The reason is that too much existing code relies on conservative garbage collection (including, but not limited to any software that uses the Boehm GC) and such changes would break a lot of existing software. Therefore there exists a very strong disincentive for the major compiler vendors to introduce such changes (at least not without offering workarounds).

I'm gonna stop talking about this, since I doubt I'll be able to change your mind, but I think you're underestimating what compiler writers are willing to do here. In any case, the only point I'm making here is that I continue to encourage you think think about ways to regain precision with your gap integration, even once you get this working.

rbehrends commented 6 years ago

That's not quite what I meant. I mean that you need some mechanism of retrieving all such allocations that were created prior to installation of these hooks.

Oh, I see. Keep in mind that for conservative scanning we only need to scan foreign stack frames and we do it because we do not know their layout. But we can control what types of objects go into them. So, for the common use case that a foreign module needs to be able to find the objects that it creates within its stack frames, you can simply start tracking those after module initialization. Objects passed in from a Julia call should still be handled via JL_GC_PUSH1() and friends (or stored permanently inside other objects or roots).

If, for some reason, you want or need to track all Julia objects, then you do indeed have to set the callbacks before initialization. The alternative would be to be able to traverse all the existing bigval lists (one per thread), which would be somewhat invasive if it's to be done in a thread-safe fashion, and I'm not sure if it's worth it.

That said, if people think that there's a good use case or a need for it, I'd definitely be happy to add such a feature.

Forgot to mention. You'll also need a mechanism to scan all threads' registers (including x/y/zmm).

Threads are tricky, but that's more because finding the end of the stack efficiently is hard, as Julia uses sigaltstack() and signals for its GC safepoint mechanism. I don't have the time tonight to do a full writeup for that (will do that later), but it basically comes down to writing some low-level code (some of it possibly architecture specific). I do not think at this point that much of this can really be alleviated much from the Julia side of things.

fingolfin commented 6 years ago

@Keno there really is no way (nor any need) to "convince" us regarding conservative stack scanning -- not because we prefer it (we are agnostic about this), or we don't want to try to be more precise (again: completely neutral, willing to try whatever works) -- but simply because it's a factual reality for us: the GAP code base is there, and in active use, we can't just drop it. It is also about as old as Python, and GAP 4 (the current major version) was started in 1996. Its C kernel consists of >100k SLOC of C code (plus a bunch of kernel extensions written by 3rd parties that need to stay compatible) -- perhaps not that big in some people's eyes, but certainly big enough so that "let's just rewrite it to use precise GC" is not something we can pull of easily. Esp. as we have to do this from inside academia, with little to no funding for thi.

As such, conservative scanning is not something we "want", or are particular in love with, it's just something that is there, in a decades old code base, deployed on dozens of different architectures and compilers in the past 30 years, with great success, and so far no issues with our GC anywhere. And while of course tomorrow a compiler might be released which breaks it -- and we'll have to deal with that when/if it happens (but since older compiler versions will still be around, we'd still have a couple years for it, luckily). But still, that's easier said than done.

So we are in no rush to rewrite our battle tested conservative GC, or rewrite our whole code base from scratch (which we can't afford money wise, nor in terms of the countless bugs a full rewrite always introduces)... But we do think about this, and have a rough long-term plan to switch to a new GC... Like, e.g., the Julia GC... hint hint.

So, IMHO the better view point is this: Integrating GAP with the Julia GC actually could open up a way to slowly and incrementally migrate the GAP code base to not rely on a conservative scanner. So, you don't need to convince us about the pros of precise scanning vs. conservative, or about the dangers of crazy compiler writers, or any of that; just trust that we know what we are doing, just like we trust that you do, and consider it as you guys helping us to have a chance to one day bask in the glorious light provided by your wondrous precise garbage collector ;-).

In a sense, compilers breaking our conservative scanning would even help us (the people working on integrating the Julia GC with GAP), by making it way easier to market our effort to more GAP devs, many of whom remain sceptical about it (like: "will this upstart Julia project be still around in a few years? Should we really tie ourselves to it?" You gotta understand the graybeards, they saw so many other project come and go... Also: we take a 20% performance hit with Julia GC compared to our existing GC, though if/when we ultimately manage to combine multi threaded Julia with multi threaded GAP properly, this likely will change in favor for Julia's GC -- and we also have some ideas for further optimizations... Doing all that is for a later stage of this project, though... baby steps first ;-)

Anyway, all that said: we'd really love it if this proposal and the corresponding pull request were ultimately accepted (of course only after taking all your helpful feedback into account), so that we can eventually steal^H^H^H^H^H benefit from your great work in multiple ways: a shiny precise GC, a shiny fast and powerful language we can use to incrementally migrate parts of our kernel away from C (thus making it automatically "precise"), etc.

Keno commented 6 years ago

Oh, I see. Keep in mind that for conservative scanning we only need to scan foreign stack frames and we do it because we do not know their layout. But we can control what types of objects go into them. So, for the common use case that a foreign module needs to be able to find the objects that it creates within its stack frames, you can simply start tracking those after module initialization. Objects passed in from a Julia call should still be handled via JL_GC_PUSH1() and friends (or stored permanently inside other objects or roots).

Ok, if you're willing to guarantee that no non-GAP julia bigval object will ever end up needing to be conservatively scanned, then tracking those objects is not necessary. That restriction would be a good thing to document in the interface though.

Keno commented 6 years ago

Threads are tricky, but that's more because finding the end of the stack efficiently is hard, as Julia uses sigaltstack() and signals for its GC safepoint mechanism. I don't have the time tonight to do a full writeup for that (will do that later), but it basically comes down to writing some low-level code (some of it possibly architecture specific). I do not think at this point that much of this can really be alleviated much from the Julia side of things.

Well, it at the very least needs to be integrated with the julia safepoint mechansim such that you get the register contents.

Keno commented 6 years ago

Oh, @vtjnash and I were both concerned about write barriers, but I see you have them in the code. Would be good to document that requirement also.

Keno commented 6 years ago

@vtjnash points out that by keeping the appropriate entry in our finalizer list, it is eligible for early finalization (using the finalize function), which may not be what you want.

vtjnash commented 6 years ago

I think this is sounding good from our end.

One alternative approach could be to avoid finalizers and the jl_gc_cb_external_free_t callback altogether, and instead provide a hook that runs after marking, before sweeping. For example, I am using this type of hook to provide management of an external pool of stacks in one of my PRs: Doing the sweep at https://github.com/JuliaLang/julia/pull/13099/files#diff-6c07ad09c7451fb0b61b494536691bfaR148, and called from https://github.com/JuliaLang/julia/pull/13099/files#diff-31b93f4272eaeede3738531e90317e0cR2555. I don't think I actually have a preference either direction.

rbehrends commented 6 years ago

I have revised the proposal and its implementation in line with the suggestions made in this thread.

Major changes:

Commits have been refactored to separate changes according to their purpose.

The callback functions now avoid using macros. In order to not pollute the namespace unnecessarily, both adding and removing a callback is done using the same function, controlled by a flag.

C finalizers were reimplemented using the different technique suggested above by @vtjnash, avoiding some of the problems with the original version. We now call them sweep functions to make it clearer that they are more limited in nature than Julia finalizers and are meant for C-level resource and memory cleanup (fully featured finalizers can, after all, still be implemented in Julia and call out to C as needed).

Thread-local data was added to the end of Julia thread-local storage rather than in global arrays indexed by the thread id.

One can now also mark an object array in order to avoid mark stack explosion for custom mark functions of large objects.

Various functions were renamed to document their purpose more clearly.

Notes about thread-safety were added.

The text now stresses more clearly that the conservative scanning facilities are meant as a feature of last resort and should not be used indiscriminately.

Bug fixes:

Enabling the conservative scanning facilities is done using a separate function, which can be called both before or after jl_init(), so that the features for identifying objects in dropped and reused pages function correctly. We document that this may incur minor overhead (depending on how much the size of the heap fluctuates, but typically no more than 2% in benchmarks for heaps with regular fluctuations). This overhead is not incurred unless the feature is enabled.

StefanKarpinski commented 6 years ago

Shall we merge this Julep then? If @rbehrends feels that it's ready and @yuyichao, @vtjnash and @Keno agree.

rbehrends commented 6 years ago

From my side, it looks ready. Obviously, I don't know if you folks are happy with the interface, but I think I've at least addressed all concerns mentioned so far.

On the implementation side, everything looks fine to me, too, though I plan on tidying up the code a bit more and to rebase it onto a more recent version over the next few days.

fingolfin commented 6 years ago

Assuming this Julep gets accepted, what are the next steps? We open a PR on https://github.com/JuliaLang/julia/ with our implementation? Which branch should that target, master? (We are kinda hoping that this might get into Julia 1.1, but of course that depends on many factors, so I am not asking for anything here, just expressing a vague hope ;-)

vtjnash commented 6 years ago

We open a PR on https://github.com/JuliaLang/julia/ with our implementation? Which branch should that target, master? We are kinda hoping that this might get into Julia 1.1.

Yes, correct, and yes. The Julep sounds very straightforward – I don't foresee any problems with this getting merged fairly quickly.