JuliaSIMD / Polyester.jl

The cheapest threads you can find!
MIT License
240 stars 13 forks source link

Invalid GC preserve #145

Closed wsmoses closed 4 months ago

wsmoses commented 4 months ago

cc @vchuravy @ChrisRackauckas @avik-pal

Polyester emits invalid GC info -- potentially leading to very serious problems downstream. In enzyme it results in a crash, and I can potentially imagine an infinite memory leak or segmentation fault occuring as a result.

using LuxLib, Polyester

y = randn(Float32, 10, 10)
b = randn(Float32, 10)
act = x -> max(x, 0)

function __apply_bias_activation!!(σ::F, x, bias::Union{Nothing, AbstractArray}) where {F}
    f_fused = σ ∘ +
    if maximum(length, (x, bias)) > 100_000
        bc = Broadcast.instantiate(Broadcast.broadcasted(f_fused, x, bias))
        @batch for I in eachindex(bc)
            @inbounds x[I] = bc[I]
        end
    else
        @. x = f_fused(x, bias)
    end
    return x
    # return LuxLib.__nonuniform_fast_broadcast!(σ ∘ +, x, bias)
end

function loss_function(act, y, b)
    return sum(__apply_bias_activation!!(act, y, b))
end

@code_llvm optimize=false raw=true loss_function(act, y, b)

You will see that there is a gc preserve begin

        %380 = call token (...) @llvm.julia.gc_preserve_begin({} addrspace(10)* %379, [2 x {} addrspace(10)*] %357), !dbg !473

which is never used and has no gc_preserve_end

polyester.ll.txt

chriselrod commented 4 months ago

What does an invalid gc_preserve look like at the AST level? https://github.com/JuliaSIMD/Polyester.jl/blob/edb49726e612cfba3af5937b5e0cea041760a2d3/src/batch.jl#L266-L278 The structure looks simple:

julia> @macroexpand GC.@preserve a b c begin
           foo(a,b,c)
       end
:($(Expr(:gc_preserve, quote
    #= REPL[1]:2 =#
    foo(a, b, c)
end, :a, :b, :c)))

julia> dump(ans)
Expr
  head: Symbol gc_preserve
  args: Array{Any}((4,))
    1: Expr
      head: Symbol block
      args: Array{Any}((2,))
        1: LineNumberNode
          line: Int64 2
          file: Symbol REPL[1]
        2: Expr
          head: Symbol call
          args: Array{Any}((4,))
            1: Symbol foo
            2: Symbol a
            3: Symbol b
            4: Symbol c
    2: Symbol a
    3: Symbol b
    4: Symbol c

All it does is produce Expr(:gc_preserve, block, preserved_symbols...), which looks like what the macro GC.@preserve does. It produces the expression manually to support vararg arguments, and to avoid heap allocations that Julia seemed to occasionally incur from constructing a tuple and preserving that instead. We don't really need/want to create an explicit tuple stack object anyway, if we don't have to.

If the produced AST is valid, then it is a Julia compiler problem. Julia has been known to miscompile GC.@preserve.

Polyester itself does not do anything lower level than building Julia ASTs for macros and @generated functions. It certainly doesn't do LLVM code generation of place tokens. The AST it emits is quite explicit about what the scope of the preserve should be, and thus where the gc_preserve_begin and gc_preserve_end should go.

chriselrod commented 4 months ago

The expression Polyester creates does end with a return, making this issue look like https://github.com/JuliaLang/julia/issues/45901

chriselrod commented 4 months ago

Which version of Julia are you on? I tried 1.10.3, 1.11-beta1, and master. In none did I see a @llvm.julia.gc_preserve_begin token. It's probably not JuliaLang/julia#45901, but I removed the useless return statement anyway in https://github.com/JuliaSIMD/Polyester.jl/pull/146 I did not notice any change in generated IR. There shouldn't be any change, so that's to be expected. But that means this probably isn't a workaround, even though it did work around the previous issue.

chriselrod commented 4 months ago

I just checked using Cthulhu.

146 fixes this issue on Julia 1.10.3.

Polyester's latest release:

L740:                                             ; preds = %L704
  br label %L741

L741:                                             ; preds = %L740, %L702
  %541 = call nonnull {}* @julia.pointer_from_objref({}* inttoptr (i64 140059852555104 to {}*)) #6
  %bitcast_coercion258 = ptrtoint {}* %541 to i64
  %542 = call i64 @"japi1___apply_bias_activation!!_1857u1870"(i64 %bitcast_coercion258, i64 %value_phi164)
  br label %L746

vs #146:

L740:                                             ; preds = %L704
  br label %L741

L741:                                             ; preds = %L740, %L702
  %541 = call nonnull {}* @julia.pointer_from_objref({}* inttoptr (i64 140322931884896 to {}*)) #6
  %bitcast_coercion258 = ptrtoint {}* %541 to i64
  %542 = call i64 @"japi1___apply_bias_activation!!_1627u1640"(i64 %bitcast_coercion258, i64 %value_phi164)
  call void @llvm.julia.gc_preserve_end(token %334)
  br label %L747

Both have

  %334 = call token (...) @llvm.julia.gc_preserve_begin({}* %15, [2 x {}*] %314)
  br label %L350

I am closing this here as we have now worked around the issue, but please open a Julia issue because this was never a Polyester bug. It is a Julia bug.