Closed luraess closed 1 year ago
We switch into parallel mode as soon as Julia uses more than one thread https://github.com/EnzymeAD/Enzyme.jl/blob/c348d48706aa118a3b887b2ad1179deb99c23596/src/compiler.jl#L9011
That's cool. However we should now figure out why it's not correct.
Interestingly, running my adjoint solver (not the above MWE) on Enzyme#main
now returns following failure when using multi-threading loop decorator:
ERROR: Enzyme execution failed.
Mismatched activity for: store {} addrspace(10)* %5, {} addrspace(10)** %.fca.0.5.gep, align 8, !dbg !148, !noalias !149 const val: {} addrspace(10)* %5
Type tree: {[-1]:Pointer, [-1,0]:Pointer, [-1,0,-1]:Float@double, [-1,8]:Integer, [-1,9]:Integer, [-1,10]:Integer, [-1,11]:Integer, [-1,12]:Integer, [-1,13]:Integer, [-1,14]:Integer, [-1,15]:Integer, [-1,16]:Integer, [-1,17]:Integer, [-1,18]:Integer, [-1,19]:Integer, [-1,20]:Integer, [-1,21]:Integer, [-1,22]:Integer, [-1,23]:Integer, [-1,24]:Integer, [-1,25]:Integer, [-1,26]:Integer, [-1,27]:Integer, [-1,28]:Integer, [-1,29]:Integer, [-1,30]:Integer, [-1,31]:Integer, [-1,32]:Integer, [-1,33]:Integer, [-1,34]:Integer, [-1,35]:Integer, [-1,36]:Integer, [-1,37]:Integer, [-1,38]:Integer, [-1,39]:Integer}
You may be using a constant variable as temporary storage for active memory (https://enzyme.mit.edu/julia/stable/#Activity-of-temporary-storage). If not, please open an issue, and either rewrite this variable to not be conditionally active or use Enzyme.API.runtimeActivity!(true) as a workaround for now
Stacktrace:
[1] macro expansion
@ ./threadingconstructs.jl:168
[2] residual_fluxes!
@ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:16
[3] residual_fluxes!
@ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:0
Stacktrace:
[1] throwerr(cstr::Cstring)
@ Enzyme.Compiler ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:2914
[2] macro expansion
@ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9513 [inlined]
[3] enzyme_call
@ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9205 [inlined]
[4] CombinedAdjointThunk
@ ~/.julia/packages/Enzyme/js9Aa/src/compiler.jl:9168 [inlined]
[5] autodiff
@ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:205 [inlined]
[6] autodiff
@ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:228 [inlined]
[7] autodiff
@ ~/.julia/packages/Enzyme/js9Aa/src/Enzyme.jl:214 [inlined]
[8] ∇_residual_fluxes!
@ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:58 [inlined]
[9] main()
@ Main /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:191
[10] top-level scope
@ /scratch-1/luraess/shiny-code/Julia-HPC-Scales/geothermal_2D_acc_kp_enzyme_2.jl:227
What happens if you turn on the flag like it asks?
@luraess regarding the original issue, can you print the output of the very first iteration for the failing and non-failing cases.
We should be able to see a difference there (hopefully) and thus debug one call.
bumping if you've been able to minimize
bumping if you've been able to minimize
Yeah using the above MWE, one needs 2 iterations to see the error appear, which returns
sum(Ā_t .- Ā) = 0.0
sum(B̄_t .- B̄) = 9.650966603658162e-8
for
for _ in 1:2
∇_fun!(A, Ā, B, B̄, a)
Ā .+= A
∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
Ā_t .+= A_t
end
EDIT: this behaviour is shown here running on a single thread, implying buggy execution for the Threads.@threads
decorated loop even if using a single thread.
It looks like the issue here is that in the matrix the values differ by ~ 10e-10, and summing up all of them hits the 10e-8.
These are very small differences, but still can be looked into.
using Enzyme
Enzyme.API.printall!(true)
function fun_t!(A, B, a)
Threads.@threads for iz in 1:1
@inbounds A[1] = a * @inbounds B[1] * 100.65
end
return
end
function fun!(A, B, a)
@inbounds A[1] = a * @inbounds B[1] * 100.65
return
end
function ∇_fun!(A, Ā, B, B̄, a)
Enzyme.autodiff(Enzyme.Reverse, fun!, Duplicated(A, Ā), Duplicated(B, B̄), Const(a))
return
end
function ∇_fun_t!(A_t, Ā_t, B_t, B̄_t, a)
Enzyme.autodiff(Enzyme.Reverse, fun_t!, Duplicated(A_t, Ā_t), Duplicated(B_t, B̄_t), Const(a))
return
end
B = [0.9044946631433972] # 0.4, 0.0, 0.1]
A = zeros(size(B))
a = 6.5
A_t = deepcopy(A)
B_t = deepcopy(B)
dB = [655.225]
dB_t = deepcopy(dB)
dA = [591.7430209949891]
dA_t = deepcopy(dA)
∇_fun!(A, dA, B, dB, a)
dA .+= A
∇_fun_t!(A_t, dA_t, B_t, dB_t, a)
dA_t .+= A_t
@show dB_t, dB
@show (dB_t .- dB)
(dB_t, dB) = ([387788.30291044683], [387788.3029104468])
dB_t .- dB = [5.820766091346741e-11]
@luraess telling Enzyme to not use fast_math on the derivative computation eliminates the difference (for the minimal case).
Enzyme.API.fast_math!(false)
This also resolves the total case above, which just seems to mean that fast math lets the threaded version compile differently from the non-threaded one.
A bit odd, but I'm going to chock it up to a non-Enzyme related issue and close (please reopen if you disagree/we should investigate further).
Interesting.
Testing again on the actual code on Enzyme v0.11.2, it seems that multi-threading is now delivering correct results (and setting Enzyme.API.fast_math!(false)
is thus not needed). However the kernels containing Threads.@threads()
fail (on ERROR: Enzyme.Compiler.EnzymeRuntimeException(Cstring(0x00000002992b8000))
) unless following is set:
Enzyme.API.runtimeActivity!(true)
Using Enzyme
v0.11.1
andmain
with Julia 1.9.1 on AMD, Intel and M2 CPU gives erroneous results when usingautodiff_deferred
in Reverse mode in a kernel that has an outer-loop decorated withThreads.@threads
and using more than 1 thread. Note that this behavior also occurs when usingautodiff
.The below MWE shows that "drift" which leads e.g. an iterative procedure to diverge,
sum(B̄_t .- B̄)
being ≠ 0. (This may also potentially relate to #611).