ballerina-platform / nballerina

Ballerina compiler that generates native executables.
https://ballerina.io/
Apache License 2.0
139 stars 46 forks source link

Identify GC roots on the stack #165

Open jclark opened 3 years ago

jclark commented 3 years ago

Based on #149 and #164, we need to build the capability to identify at runtime the GC roots on the stack.

Concretely, we want to add a C file to the runtime directory that provides a markRoots function that can be used by our garbage collector:

typedef char *Root;
typedef void (*AddRootFunctionPtr)(Root *root, void *arg);
void markRoots(AddRootFunctionPtr func, void *arg);

Not sure if this is the best API. We should probably look MMTK and figure out something that is a good fit.

KavinduZoysa commented 3 years ago

@jclark, when we are adding the statepoint intrinsic for call/invoke instructions to track the GC roots in StackMap section, what we can do is running the opt command (with --rewrite-statepoints-for-gc flag). But in gollvm, they have written a pass called go-statepoints which is same as rewrite-statepoints-for-gc in llvm.

What can we do for our case?

jclark commented 3 years ago

I don’t understand the question. Why can we not use opt —rewrite-statepoints-for-gc for now? I guess eventually we will have to add our own pass. This is what you need to figure out…

KavinduZoysa commented 3 years ago

When I was testing the statepoints, I also used opt —rewrite-statepoints-for-gc. When I am looking at the gollvm source code I saw they are using a different pass for that. That is why I need to ask whether we use the command-line option or a different pass in our initial implementation. Now I understood.

jclark commented 3 years ago

How are we going to deal with derived pointers?

KavinduZoysa commented 3 years ago

If we define a derived pointer and we use that pointer after the statepoint, RS4GC pass(--rewrite-statepoints-for-gc) identifies that derived pointer and rematerialize. Please check the following example.

Before run RS4GC:

define void @dummy_func() {
  ret void
}

define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) gc "statepoint-example" {
  %gep = getelementptr i8, i8 addrspace(1)* %obj, i64 20000
  call void @dummy_func() 
  %p = getelementptr i8, i8 addrspace(1)* %gep, i64 -20000
  ret i8 addrspace(1)* %p
}

define dso_local void @main() #0 {
  %1 = alloca i8*, align 8
  %2 = call noalias i8 addrspace(1)* @malloc(i64 8) #3
  %3 = call i8 addrspace(1)* @test1(i8 addrspace(1)* %2)
  ret void
}

declare dso_local noalias i8 addrspace(1)* @malloc(i64) #1

After running RS4GC (changed method is pasted):

define i8 addrspace(1)* @test1(i8 addrspace(1)* %obj) gc "statepoint-example" {
  %gep = getelementptr i8, i8 addrspace(1)* %obj, i64 20000
  %statepoint_token = call token (i64, i32, void ()*, i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void ()* @dummy_func, i32 0, i32 0, i32 0, i32 0) [ "gc-live"(i8 addrspace(1)* %obj) ]
  %obj.relocated = call coldcc i8 addrspace(1)* @llvm.experimental.gc.relocate.p1i8(token %statepoint_token, i32 0, i32 0) ; (%obj, %obj)
  %gep.remat = getelementptr i8, i8 addrspace(1)* %obj.relocated, i64 20000
  %p = getelementptr i8, i8 addrspace(1)* %gep.remat, i64 -20000
  ret i8 addrspace(1)* %p
}
jclark commented 3 years ago

Oh I see. That's clever. We need to make sure this works with our tagged pointers. Can you check that it works with the ptrmask intrinsic?

Is there a way to control which functions get treated as safepoints and rewritten via RS4GC? I think we will want to not treat some of the calls to langlib functions as safepoints.

This relates to another problem. How to deal with GC roots in the stack frames from runtime functions written in C?

KavinduZoysa commented 3 years ago

Oh I see. That's clever. We need to make sure this works with our tagged pointers. Can you check that it works with the ptrmask intrinsic?

RS4GC identifies the output of ptrmask intrinsic as a gc-live pointer and it is added to the gc.statepoint intrinsic. Please check this LLVM IR, generated via RS4GC for the below example.

import ballerina/io;
public function main() {
    any[] v = [1234567890123456789];
    io:println(v[0]); 
}

Is there a way to control which functions get treated as safepoints and rewritten via RS4GC? I think we will want to not treat some of the calls to langlib functions as safepoints.

Yes, we can do it by using "gc-leaf-function". Ex: declare i8 addrspace(1)* @_bal_int_to_tagged(i64) "gc-leaf-function" ; _bal_int_to_tagged is not a statepoint

This relates to another problem. How to deal with GC roots in the stack frames from runtime functions written in C?

I also need the think about this further. But the current approach I came up is explained below. The output pointer value of the runtime function is a gc-live pointer. So it is available in gc.statepoint intrinsic. When we are marking the roots, we know the value of that pointer and using that we can find the actual root.

For example let's say we call _bal_mapping_construct and we get a tagged pointer. So this is a gc-live pointer but it does not represent an actual heap allocation. But we can remove the tag and mark it as a root. (i.e. ((uint64_t)p0 ^ (uint64_t)TAG_MAPPING_RW << TAG_SHIFT))

jclark commented 3 years ago

For example let's say we call _bal_mapping_construct and we get a tagged pointer. So this is a gc-live pointer but it does not represent an actual heap allocation. But we can remove the tag and mark it as a root. (i.e. ((uint64_t)p0 ^ (uint64_t)TAG_MAPPING_RW << TAG_SHIFT))

This is the same as with stack frames from our compiler: address space 1 pointers may or may not have a tag. I don't think that's a problem.

The problem I am concerned about is something like _bal_mapping_set. Suppose it needs to grow the array (calling _bal_array_grow). Suppose this triggers a garbage collection. This may cause the heap objects referenced by the key and value parameters to be moved. Now when _bal_mapping_set accesses its key and value parameters after _bal_array_grow, it will be accessing the wrong memory locations.

jclark commented 3 years ago

A simpler case is _Barray_push.

KavinduZoysa commented 3 years ago

https://github.com/KavinduZoysa/test-GCs/tree/find-gc-roots, Please find the progress of the issue in this repo

jclark commented 3 years ago

Two alternative strategies for https://github.com/ballerina-platform/nballerina/issues/165#issuecomment-884145261

  1. the C code does something explicit that makes the GC collector aware of the variables that need updating after a GC e.g. calls a function add_gc_root(context, &ptr)
  2. take the .ll output by clang and massage it (e.g. by adding gc "strategy") to look like the code that we generate from .bal
jclark commented 3 years ago

In any case, I think we will want to distinguish between our C runtime functions that can trigger (or allow) a garbage collection and those that don't. The former are the ones we need to worry about. The latter can be declared as "gc-leaf-function" and don't need any special treatment. Functions that can themselves call _bal_panic need to be in the former category.

jclark commented 3 years ago

I should say that I definitely prefer strategy 2 if we can make it work, since it makes the C code consistent with the code we generate, and should be more reliable.

KavinduZoysa commented 3 years ago

I should say that I definitely prefer strategy 2 if we can make it work, since it makes the C code consistent with the code we generate, and should be more reliable.

I have started to develop a working example. Please find the progress here. https://github.com/KavinduZoysa/test-GCs/tree/find-gc-roots/solution2

KavinduZoysa commented 3 years ago

https://github.com/KavinduZoysa/test-GCs/tree/79d28f9836c0ed0feb9e4bf864ba1217a636b06b

Here I have tested the identifying roots for a runtime function according to the 2nd approach of https://github.com/ballerina-platform/nballerina/issues/165#issuecomment-884854477. These are the steps I followed.

  1. Generate LLVM IR for runtime C source using clang.
  2. Mark gc statepoint on function signature.
  3. Generate stack maps (We can see two stack maps in the final assembly file because we have two LLVM modules now. The first one is the LLVM IR generated from ballerina and the second one is LLVM IR generated for runtime code)
  4. Change the runtime code to combine two stack maps into one defined table.
  5. Read the roots of the runtime function.
jclark commented 3 years ago

A few comments:

KavinduZoysa commented 3 years ago
  • With approach 2, clang is processing address space 1 without knowing that it is non-integral. I am concerned that this may mean that clang does optimizations that will make the stack map not correct (this is why we have non-integral address spaces). If we turn off optimization, then it gives functions a noinline attribute. Maybe we should turn off optimization but remove or otherwise override this attribute.

+1

  • Shouldn't we be combining stack maps at link time?

If we check the object dump of final executable we can se there are two stack maps.

Yes, I am using that to generate the hash table and lookup frame info for the given return address https://github.com/KavinduZoysa/test-GCs/tree/find-gc-roots/solution2/runtime/gc. Also, I am using the same approach to get the rsp https://github.com/KavinduZoysa/test-GCs/blob/find-gc-roots/solution2/runtime/shim.s.

jclark commented 3 years ago

If we check the object dump of final executable we can se there are two stack maps.

Right. So we would need to do extra work at link time to combine them. Can you look at how Julia does this?

KavinduZoysa commented 3 years ago

Right. So we would need to do extra work at link time to combine them. Can you look at how Julia does this?

Sure, I will work on that.

KavinduZoysa commented 3 years ago

Julia's GC strategy is similar to shadow stack and it does not use LLVM's GC framework, which means they do not use LLVM stackmaps to identify GC roots. Therefore we cannot gather information related to LLVM stackmaps from Julia.

The other thing is, how does Julia handle the roots in C functions. Form C side it calls method JL_GC_PUSHx() to track the roots.https://docs.julialang.org/en/v1/manual/embedding/#Memory-Management

jclark commented 3 years ago

Interesting though that Julia is using non-integral pointers in LLVM code, plus something explicit in C code. Does Julia have a moving GC? I guess not.

The shadow stack thing is interesting: I wonder if that might be a good way to get started with GC.

KavinduZoysa commented 3 years ago

Interesting though that Julia is using non-integral pointers in LLVM code, plus something explicit in C code. Does Julia have a moving GC? I guess not.

They are using custom address spaces(10 - GC tracked pointers, 11 - Derived pointers, etc) and non integral address spaces to keep the GC related information safe, even after the program passes through optimizations(This point is explained here). They have a GC root placement LLVM pass at the end of their pass pipeline. In that pass, they are creating struct call gc frame to hold in roots informations. In C side JL_GC_PUSHx() method is doing the same thing (create gc frame) https://groups.google.com/g/julia-users/c/7yW0zqiTwc4.

Yes, it is a non-relocating GC.

The shadow stack thing is interesting: I wonder if that might be a good way to get started with GC.

AFAIU, shadow stack approach and statepoint approach we are using currenlty, are completely different approaches. So in the future, if we need to move statepoint approach again we have to spend some considerable time on that. But in the document, it is mentioned that, shadow stack is an easy way to get started.

jclark commented 3 years ago

Please have a look at #300, to see how this fits in, in particular point 7.

I think we've done enough exploring. It's time to write the code to resolve this issue. I suggest you start a new branch "gc". Can you make statepoints work for both C and LLVM-generated code? If so, we should do that. When you've done this, then go on to #300. We will need some test programs that do a lot of allocation and deallocation. This will determine whether it's all working. I suggest you focus on getting it working reliably first (including #300). After that, we can focus on getting the code quality to where it needs to be. Then we can merge the branch in.

KavinduZoysa commented 3 years ago

I will go to #300.

KavinduZoysa commented 3 years ago

@jclark, when enabling GC for runtime functions, first I selected list.c and I am trying to run the following commands.

    clang-11 -O2 -S -emit-llvm -o list.ll list.c
    sed -i -e '/define /s/ {/ gc "statepoint-example" {/' -e '/target datalayout/ s/"$$/-ni:1"/' list.ll
    opt-11 -S -O2 --rewrite-statepoints-for-gc -o list_opt.ll list.ll
    llc-11 -O2 --filetype=obj -o list.o list_opt.ll

But opt command crashes and it is unable to generate statepoints. I ran opt-11 -S -O2 -o list_opt.ll list.ll and output LLVM IR is attached here.

I went through the llvm IR and I found that compiler crashes at L60-L61. I am working on finding the root cause to fix this issue.

jclark commented 3 years ago

I thought the plan was to use -O0 with -emit-llvm and then edit out the noinline attribute.

Also we should mark functions as leaves where appropriate.

If we still get LLVM crashes, then we should try an alternative approach. At this stage, we do not want to get into debugging LLVM.

KavinduZoysa commented 3 years ago

I thought the plan was to use -O0 with -emit-llvm and then edit out the noinline attribute.

I tried this and LLVM has crashed again. Then I marked noinline for taggedToPtr. Then it works without crashing LLVM. The steps I followed in breif:

  1. Run clang-11 -S -emit-llvm -o list.ll list.c
  2. Remove noinline in all functions except taggedToPtr
  3. Run sed -i -e '/define /s/ {/ gc "statepoint-example" {/' -e '/target datalayout/ s/"$$/-ni:1"/' list.ll
  4. opt-11 -S -O2 --rewrite-statepoints-for-gc -o list_opt.ll list.ll
  5. llc-11 -O2 --filetype=obj -o list.o list_opt.ll

Also we should mark functions as leaves where appropriate.

Yes, I will add

jclark commented 3 years ago

taggedToPtr should really be done with the ptrmask intrinsic. See #201.

So really what we should do is:

So please fix #201 and submit a pull request to main.