Closed jlk9 closed 5 months ago
I call @synchronize() at the end of GradientOnEdgeModified, though.
This has no effect since you are not performing shared memory operations.
That's what I thought, still mentioned it for completeness. I guess the KA.synchronize(backend) call should synchronize global memory, but it currently does nothing looking in the cpu.jl
file.
A yield()
in the position of @show
is sufficient.
I guess the KA.synchronize(backend) call should synchronize global memory, but it currently does nothing looking
All kernel launches are synchronized on the CPU. https://github.com/JuliaGPU/KernelAbstractions.jl/blob/4285051c3ea949aee8a024383f38c8849f602729/src/cpu.jl#L89-L90
This is very weird, when I comment out the call to Enzyme this spooky action from a distance disappears as well.
Also moving the FD calls before the Enzyme calls. So I would still suspect Enzyme doing something here that for some reason causes it to overlap with the FD code and corrupt some data...
Check if the Mesh is correct after Enzyme.
I'll check that! The one caveat with Enzyme being the culprit is, getting rid of the MOKA mesh component and replacing it with a stand-in struct (you can see that in the commented out portion of bug.jl) also removed the bug, even with Enzyme still there. That doesn't rule out the idea of Enzyme affecting the mesh though
Yeah my one hypothesis is that Enzyme creates a task that is not waited upon... and then modifies the mesh as well... Which each alone would be a very weird bug indeed, but together makes me feel ridiculous for proposing it.
Looks like the autodiff call permutes mesh.Edges.dcEdge
Ok. So setting mesh.Edges.dcEdge
to its pre-Enzyme value after calling Enzyme.autodiff
makes the FD results correct:
mesh.Edges.dcEdge[:] = old_mesh.Edges.dcEdge[:]
Where old_mesh is a deepcopy() of mesh before the autodiff call. Interesting that dcEdge is used in the kernel, but not modified.
Since this bug seems to be more of an Enzyme issue I've opened one there: https://github.com/EnzymeAD/Enzyme.jl/issues/1569. Enzyme incorrectly permutes the array mesh.Edges.dcEdge
, which altered the FD results.
What happens if you do autodiff but all the values are constant
Also can you isolate the error without as many dependencies?
Per discussion here, I believe Enzyme is correct: https://github.com/EnzymeAD/Enzyme.jl/issues/1569#issuecomment-2189692029
Specifically it behaves the same as calling the orginal code.
Enzyme.Const != immutable. It just means not differentiated. A variable which is Const but modified in place in the function to be differentiated will also be updated in place the same way during AD.
Wait never mind, it looks like my computer just simply did not reproduce your issue.
What version of Enzyme/KA/etc are you using?
I'm at:
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ls^C
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ ~/git/Enzyme.jl/julia-1.10.2/bin/julia --project
_
_ _ _(_)_ | Documentation: https://docs.julialang.org
(_) | (_) (_) |
_ _ _| |_ __ _ | Type "?" for help, "]?" for Pkg help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 1.10.2 (2024-03-01)
_/ |\__'_|_|_|\__'_| | Official https://julialang.org/ release
|__/ |
(MOKA) pkg> st
Project MOKA v0.1.0
Status `~/git/Enzyme.jl/MPAS-Ocean.jl/Project.toml`
⌃ [21141c5a] AMDGPU v0.9.5
[7d9f7c33] Accessors v0.1.36
[79e6a3ab] Adapt v4.0.4
[6e4b80f9] BenchmarkTools v1.5.0
[179af706] CFTime v0.1.3
[052768ef] CUDA v5.4.2
[8bb1440f] DelimitedFiles v1.9.1
[7da242da] Enzyme v0.12.20 `..`
⌃ [28b8d3ca] GR v0.73.5
⌃ [63c18a36] KernelAbstractions v0.9.20
⌃ [da04e1cc] MPI v0.20.8
[3da0fdf6] MPIPreferences v0.1.11
[85f8d34a] NCDatasets v0.14.4
[91a5bcdd] Plots v1.40.4
[295af30f] Revise v3.5.14
[09ab397b] StructArrays v0.6.18
[3a884ed6] UnPack v1.0.2
[ddb6d928] YAML v0.4.11
[7cb0a576] MPICH_jll v4.2.1+1
[ade2ca70] Dates
[56ddb016] Logging
[10745b16] Statistics v1.10.0
Info Packages marked with ⌃ have new versions available and may be upgradable.
(MOKA) pkg> ^C
julia>
wmoses@beast:~/git/Enzyme.jl/MPAS-Ocean.jl (bug-reduce) $ cd ..
wmoses@beast:~/git/Enzyme.jl (uparm) $ git log
commit 758e7b42c45d1667c774fdbef938c349a414fa40 (HEAD -> uparm, origin/uparm)
Author: William S. Moses <gh@wsmoses.com>
Date: Tue Jun 25 14:05:19 2024 -0400
uparm
commit f7ef35364f818539a494e6381100e6ac921ff339 (tag: v0.12.19, origin/main, origin/HEAD)
Author: William Moses <gh@wsmoses.com>
Date: Tue Jun 25 00:43:06 2024 -0400
Update Project.toml
grad and mesh are an array and struct respectively so they're duplicated objects. I tried replacing Horzmesh with a self-written struct in the file to remove the MOKA dependency, I'll try again.
@jlk9 can try using Duplicated
instead of Const
? If that fixes it it's a case of "shadow-confusion",
but even then I am confused about why a @show
or yield()
changes things.
Mentioned in the Enzyme issue too, I reduced the example to eliminate dependency on MOKA.jl. Now I'm just using a stand-in struct in bugMesh.jl that contains the dcEdges array
@wsmoses Missed your question about Enzyme and KA versions, sorry! I was responding to some things on my phone. I'm on Enzyme 0.12.15 (what you get with just naive add Enzyme) and KA 0.9.20
Replied in other issue, but repasting here (https://github.com/EnzymeAD/Enzyme.jl/issues/1569#issuecomment-2191290389)
Okay I see what's happening here, HorzMesh should likely not be marked inactive since you are differentiating wrt data within. This is implicitly causing runtime activity style issues, but not throwing an error for them.
Let's continue this on Enzyme.jl.
The only KA relevant part might be the show that is needed for it to trigger.
This function computes a numerical gradient, then sums up the results using a KA kernel run on the CPU:
Computing numerical derivatives of this function produces the following output:
where Enzyme is an AD tool, and the finite difference computation uses these values:
this makes sense: finite differences and an AD tool produce very similar results. However, what if we comment out the
@show
statement ingradient_normSq
? Then we get this result:The AD tool still works fine, but the perturbed norm computations for FD are now wrong. Nothing was changed except for removing an
@show
statement.The code producing this error can be found in
https://github.com/jlk9/MPAS-Ocean.jl/tree/bug-reduce
, in the filebug.jl
. I tried reducing it further, but getting rid of the mesh objects from the MPAS-Ocean package or the enzyme call caused the bug to disappear. Those don't directly affect the results ofnormSqP
andnormSqM
, so maybe it's an issue of timing and CPU kernels being asynchronous? I call@synchronize()
at the end ofGradientOnEdgeModified
, though.