EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
455 stars 63 forks source link

Erroneous `autodiff` results when multi-threading enabled on various CPU arch #903

Closed luraess closed 1 year ago

luraess commented 1 year ago

Using Enzyme v0.11.1 and main with Julia 1.9.1 on AMD, Intel and M2 CPU gives erroneous results when using autodiff_deferred in Reverse mode in a kernel that has an outer-loop decorated with Threads.@threads and using more than 1 thread. Note that this behavior also occurs when using autodiff.

The below MWE shows that "drift" which leads e.g. an iterative procedure to diverge, sum(B̄_t .- B̄) being ≠ 0. (This may also potentially relate to #611).

using Enzyme

function fun_t!(A, B, a)
    Threads.@threads for iz ∈ axes(A, 2)
        for ix ∈ axes(A, 1)
            A[ix, iz] += a * B[ix, iz] * 100.65
        end
    end
    return
end

function fun!(A, B, a)
    for iz ∈ axes(A, 2)
        for ix ∈ axes(A, 1)
            A[ix, iz] += a * B[ix, iz] * 100.65
        end
    end
    return
end

function ∇_fun!(A, Ā, B, B̄, a)
    Enzyme.autodiff_deferred(Enzyme.Reverse, fun!, Duplicated(A, Ā), Duplicated(B, B̄), Const(a))
    return
end

function ∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
    Enzyme.autodiff_deferred(Enzyme.Reverse, fun_t!, Duplicated(A_t, Ā_t), Duplicated(B_t, B̄_t), Const(a))
    return
end

function main()
    A = rand(101, 200)
    B = rand(101, 200)
    a = 6.5

    A_t = deepcopy(A)
    B_t = deepcopy(B)

    Ā   = ones(size(A))
    B̄   = ones(size(B))
    Ā_t = ones(size(A))
    B̄_t = ones(size(B))

    for _ in 1:1000
        ∇_fun!(A, Ā, B, B̄, a)
        Ā .+= A
        ∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
        Ā_t .+= A_t
    end

    @show sum(Ā_t .- Ā)
    @show sum(B̄_t .- B̄)
    return
end
vchuravy commented 1 year ago

We switch into parallel mode as soon as Julia uses more than one thread https://github.com/EnzymeAD/Enzyme.jl/blob/c348d48706aa118a3b887b2ad1179deb99c23596/src/compiler.jl#L9011

luraess commented 1 year ago

That's cool. However we should now figure out why it's not correct.

luraess commented 1 year ago

Interestingly, running my adjoint solver (not the above MWE) on Enzyme#main now returns following failure when using multi-threading loop decorator:

ERROR: Enzyme execution failed.
Mismatched activity for:   store {} addrspace(10)* %5, {} addrspace(10)** %.fca.0.5.gep, align 8, !dbg !148, !noalias !149 const val: {} addrspace(10)* %5
Type tree: {[-1]:Pointer, [-1,0]:Pointer, [-1,0,-1]:Float@double, [-1,8]:Integer, [-1,9]:Integer, [-1,10]:Integer, [-1,11]:Integer, [-1,12]:Integer, [-1,13]:Integer, [-1,14]:Integer, [-1,15]:Integer, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integer, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer, [-1,24]:Integer, [-1,25]:Integer, [-1,26]:Integer, [-1,27]:Integer, [-1,28]:Integer, [-1,29]:Integer, [-1,30]:Integer, [-1,31]:Integer, [-1,32]:Integer, [-1,33]:Integer, [-1,34]:Integer, [-1,35]:Integer, [-1,36]:Integer, [-1,37]:Integer, [-1,38]:Integer, [-1,39]:Integer}
You may be using a constant variable as temporary storage for active memory (https://enzyme.mit.edu/julia/stable/#Activity-of-temporary-storage). If not, please open an issue, and either rewrite this variable to not be conditionally active or use Enzyme.API.runtimeActivity!(true) as a workaround for now

Stacktrace:
 [1] macro expansion
   @ ./threadingconstructs.jl:168
 [2] residual_fluxes!
   @ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:16
 [3] residual_fluxes!
   @ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:0

Stacktrace:
  [1] throwerr(cstr::Cstring)
    @ Enzyme.Compiler ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:2914
  [2] macro expansion
    @ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9513 [inlined]
  [3] enzyme_call
    @ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9205 [inlined]
  [4] CombinedAdjointThunk
    @ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9168 [inlined]
  [5] autodiff
    @ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:205 [inlined]
  [6] autodiff
    @ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:228 [inlined]
  [7] autodiff
    @ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:214 [inlined]
  [8] ∇_residual_fluxes!
    @ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:58 [inlined]
  [9] main()
    @ Main /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:191
 [10] top-level scope
    @ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:227
wsmoses commented 1 year ago

What happens if you turn on the flag like it asks?

wsmoses commented 1 year ago

@luraess regarding the original issue, can you print the output of the very first iteration for the failing and non-failing cases.

We should be able to see a difference there (hopefully) and thus debug one call.

wsmoses commented 1 year ago

bumping if you've been able to minimize

luraess commented 1 year ago

bumping if you've been able to minimize

Yeah using the above MWE, one needs 2 iterations to see the error appear, which returns

sum(Ā_t .- Ā) = 0.0
sum(B̄_t .- B̄) = 9.650966603658162e-8

for

for _ in 1:2
    ∇_fun!(A, Ā, B, B̄, a)
    Ā .+= A
    ∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
    Ā_t .+= A_t
end

EDIT: this behaviour is shown here running on a single thread, implying buggy execution for the Threads.@threads decorated loop even if using a single thread.

wsmoses commented 1 year ago

It looks like the issue here is that in the matrix the values differ by ~ 10e-10, and summing up all of them hits the 10e-8.

These are very small differences, but still can be looked into.

wsmoses commented 1 year ago
using Enzyme

Enzyme.API.printall!(true)

function fun_t!(A, B, a)
    Threads.@threads for iz in 1:1
        @inbounds A[1] = a * @inbounds B[1] * 100.65
    end
    return
end

function fun!(A, B, a)
    @inbounds A[1] = a * @inbounds B[1] * 100.65
    return
end

function ∇_fun!(A, Ā, B, B̄, a)
    Enzyme.autodiff(Enzyme.Reverse, fun!, Duplicated(A, Ā), Duplicated(B, B̄), Const(a))
    return
end

function ∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
    Enzyme.autodiff(Enzyme.Reverse, fun_t!, Duplicated(A_t, Ā_t), Duplicated(B_t, B̄_t), Const(a))
    return
end

B = [0.9044946631433972] # 0.4, 0.0, 0.1] 
A = zeros(size(B)) 
a = 6.5

A_t = deepcopy(A)
B_t = deepcopy(B)

    dB = [655.225]
    dB_t = deepcopy(dB)
    dA = [591.7430209949891]
    dA_t = deepcopy(dA)

    ∇_fun!(A, dA, B, dB, a)
    dA .+= A
    ∇_fun_t!(A_t, dA_t, B_t, dB_t, a)
    dA_t .+= A_t

@show dB_t, dB
@show (dB_t .- dB)
(dB_t, dB) = ([387788.30291044683], [387788.3029104468])
dB_t .- dB = [5.820766091346741e-11]
wsmoses commented 1 year ago

@luraess telling Enzyme to not use fast_math on the derivative computation eliminates the difference (for the minimal case).

Enzyme.API.fast_math!(false)
wsmoses commented 1 year ago

This also resolves the total case above, which just seems to mean that fast math lets the threaded version compile differently from the non-threaded one.

A bit odd, but I'm going to chock it up to a non-Enzyme related issue and close (please reopen if you disagree/we should investigate further).

luraess commented 1 year ago

Interesting.

Testing again on the actual code on Enzyme v0.11.2, it seems that multi-threading is now delivering correct results (and setting Enzyme.API.fast_math!(false) is thus not needed). However the kernels containing Threads.@threads() fail (on ERROR: Enzyme.Compiler.EnzymeRuntimeException(Cstring(0x00000002992b8000))) unless following is set:

Enzyme.API.runtimeActivity!(true)