GC.gc() unable to free memory in the scope it is called from?

Tortar commented 11 months ago

I opened some days ago this issue here: https://github.com/apache/arrow-julia/issues/492. But after experimenting quite a bit I think this is a GC issue.

MWE (which I runned in REPL mode) similar as the one reported here https://discourse.julialang.org/t/how-to-force-an-object-to-be-freed-by-the-garbage-collector/77985:

function testme()
    X = rand(1_000_000_00)
    Y = sum(X)
    X = nothing
    GC.gc(); GC.gc(); GC.gc()
    return Y
end

function tester()
    Y = testme()
    return Y
end

tester()
# GC.gc() here frees it

the memory occupied by X is not freed, if I put after the call to tester a GC call instead it is.

Version Info:

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 12 virtual cores

nsajko commented 11 months ago

Your examples rely on user packages, so I don't see why would you think this is a Julia bug. Apart from that, it's not even clear what behavior you consider buggy.

Are you aware that languages that use garbage collection (except simple forms like reference counting, like with C++, for example) don't provide guarantees about when some pointer will be freed, or when the finalizer of some object will be ran. The precise moment is an implementation detail. Wikipedia, furthermore, says:

Notably, both Java and Python do not guarantee that finalizers will ever be called, and thus they cannot be relied on for cleanup.

Tortar commented 11 months ago

I changed it to one not relying on user packages. Thanks for the info about the garbage collection in other languages, I had the impression that a GC.gc() call (or multiple ones) would have guaranteed to free the unused memory. This could be nonetheless useful sometimes, but I can't judge how much of a hard task is it to make this an option, probably a lot

Tortar commented 11 months ago

Then given that it is probably like so for good reasons, maybe changing the docstring of GC.gc() could be a good idea? For now it is:

  Perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps
  all objects, which makes the next GC scan much slower, while an incremental collection may only sweep so-called
  young objects.

maybe saying that it is a suggestion is better?

  Suggest to perform garbage collection. The argument full determines the kind of collection: A full collection (default) sweeps
  all objects, which makes the next GC scan much slower, while an incremental collection may only sweep so-called
  young objects.

PallHaraldsson commented 10 months ago

See also my long reply on discourse on using Bumper.jl that WILL guarantee deallocation. GC languages in general can't do it, but in exception cases it could be done, with the help of the compiler, as I explained. I think Julia should actually try to to that, or even guarantee in all possible cases, as Mojo does.

vchuravy commented 10 months ago

Julia uses a precise GC together with a GC shadow stack.

julia> function testme()
           X = @noinline rand(1_000_000_00)
           Y = @noinline sum(X)
           X = nothing
           GC.gc()
           return Y
       end

Let's take a look at the IR generated:

julia> @code_llvm  testme()
;  @ REPL[7]:1 within `testme`
define double @julia_testme_190() #0 {
top:
  %gcframe2 = alloca [3 x {}*], align 16
  %gcframe2.sub = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 0
  %0 = bitcast [3 x {}*]* %gcframe2 to i8*
  call void @llvm.memset.p0i8.i64(i8* align 16 %0, i8 0, i64 24, i1 true)
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #8
  %tls_ppgcstack = getelementptr i8, i8* %thread_ptr, i64 -8
  %1 = bitcast i8* %tls_ppgcstack to {}****
  %tls_pgcstack = load {}***, {}**** %1, align 8
;  @ REPL[7]:2 within `testme`
  %2 = bitcast [3 x {}*]* %gcframe2 to i64*
  store i64 4, i64* %2, align 16
  %3 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 1
  %4 = bitcast {}** %3 to {}***
  %5 = load {}**, {}*** %tls_pgcstack, align 8
  store {}** %5, {}*** %4, align 8
  %6 = bitcast {}*** %tls_pgcstack to {}***
  store {}** %gcframe2.sub, {}*** %6, align 8
  %7 = call nonnull {}* @j_rand_192(i64 signext 100000000)
  %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  store {}* %7, {}** %8, align 16
;  @ REPL[7]:3 within `testme`
  %9 = call double @j_sum_193({}* nonnull %7)
;  @ REPL[7]:5 within `testme`
; ┌ @ gcutils.jl:129 within `gc` @ gcutils.jl:129
   call void inttoptr (i64 140443355088864 to void (i32)*)(i32 1)
   %10 = load {}*, {}** %3, align 8
   %11 = bitcast {}*** %tls_pgcstack to {}**
   store {}* %10, {}** %11, align 8
; └
;  @ REPL[7]:6 within `testme`
  ret double %9
}

  %gcframe2 = alloca [3 x {}*], align 16

Is the creation of the shadow stack. Simplifying a bit every allocated object inside a function get's assinged a slot in this gcframe and GC scans the data structure (which forms a linked list) to find objects that are used.

  %7 = call nonnull {}* @j_rand_192(i64 signext 100000000)
  %8 = getelementptr inbounds [3 x {}*], [3 x {}*]* %gcframe2, i64 0, i64 2
  store {}* %7, {}** %8, align 16

Now this is the call to rand and store of X to the gcframe, so that it is rooted for the next step the call to sum.

  %9 = call double @j_sum_193({}* nonnull %7)

Now immediately after we call GC.gc

 ┌ @ gcutils.jl:129 within `gc` @ gcutils.jl:129
   call void inttoptr (i64 140443355088864 to void (i32)*)(i32 1)

But X is still rooted on the gcstack. Once we exit the function we pop the gcframe from the pgcstack and X can be freed.

There might be an open issue flying around, but if I recall correctly we are currently missing the step where we zero out the GC slot when the lifetime of the object ends, since that is often not necessary to do (e.g. if you exit the function the memory will be freed).

Probably the place to change is around https://github.com/JuliaLang/julia/blob/3c4bccd93b2b7d9954ebd3ab43268a3e13185775/src/llvm-late-gc-lowering.cpp#L2620

PallHaraldsson commented 10 months ago

I think you're saying GC is called. But is GC guaranteed to free large objects, even for full (the default for GC.gc()? I mean is it promised or a known limitation. Should is be promised, at least for full GC? This can be avoided with Bumper.jl, i.e. it eliminates the problem as opt-in. Julia should maybe add Bumber.jl as stdlib, similar to Mojo. I.e. a compile time guarantee. But until there, there is a large and small distinction for objects in the GC, right? Just pointing out in case it explains the (perceived) problem.

vchuravy commented 10 months ago

I see no reason why Bumper is part of this discussion?

Julia's GC is precise and generational.

A full GC will sweep all "garbage" objects.
An incremental GC will sweep all "garbage" & young objects.

Now the challenge is what is "garbage". Garbage is all objects that are detached from the object-graph and are not in one of the root-sets.

Due to Julia being precise the root-sets include "all objects that are used within a function".

So when we see behavior that surprises us like the original post the question we need to answer: Why is this object live? It isn't referenced anywhere so the only remaining bit is that it must be part of root-set.

My hope was to demystify Julia a bit by walking through the reason why X is live until the end of the function, which is a wider lifetime than expected. There is a simple technical reason (e.g. we didn't think it was worth it) and maybe we should revisit that.

So yes the GC is guaranteed to free objects, but it needs to see them as garbage.

PallHaraldsson commented 10 months ago

So when we see behavior that surprises us like the original post the question we need to answer: Why is this object live?

The topic here is why GC.gc() is "unable to free", and as soon as "X = nothing" happens in the code there, X is no longer live, so it's a good question (and should be answered), and yes this should get fixed. But I mean X could be freed right there, with Libc.free implicitly, before the next line, or any other (with allocation) triggering it. The GC without help couldn't. My understanding is that is what Mojo does, why I brought up Bumper. I believe it's similar, just explicit, not implicit (and doesn't use the heap; uses its own stack for allocated objects, I'm not sure Mojo does the same or works for the heap, maybe both).

gbaraldi commented 10 months ago

The GC could free that object, the issue is that the generated code mark that object as dead until the end of the function, so when GC.gc() runs it still thinks the array is live.

elextr commented 10 months ago

@gbaraldi just to check my understanding, you are saying because X is unused after the call to sum() the compiler determines that it can optimise away X = nothing (its not in the IR @vchuravy posted) because there is no use of X after that point.

But of course GC.gc() in a sense "uses" everything, but that information is probably not available to the compiler.

JuliaLang / julia

GC.gc() unable to free memory in the scope it is called from? #51818