EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
455 stars 63 forks source link

StackOverflow when used with Flux #1823

Open BioTurboNick opened 1 month ago

BioTurboNick commented 1 month ago

I don't know where to begin for troubleshooting or making a minimal example. Or for a more specific title. First time trying Enzyme.

I changed the Flux train! function from:

Flux.train!(network, (training_data,), opt_state)

to:

Flux.train!(Duplicated(network, make_zero(network)), (training_data,), opt_state)

But I'm not sure if that's correct, documentation on this usage is a bit sparse.

Also not sure why I only have 22 frames shown.

ERROR: StackOverflowError:
Stacktrace:
  [1] LLVMRunPassManager
    @ C:\Users\nicho\.julia\packages\LLVM\UqMfW\lib\15\libLLVM.jl:3385 [inlined]
  [2] run!
    @ C:\Users\nicho\.julia\packages\LLVM\UqMfW\src\passmanager.jl:39 [inlined]
  [3] #18868
    @ C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler\optimize.jl:2010 [inlined]        
  [4] LLVM.ModulePassManager(::Enzyme.Compiler.var"#18868#18875"{LLVM.Module}; kwargs::@Kwargs{})
    @ LLVM C:\Users\nicho\.julia\packages\LLVM\UqMfW\src\passmanager.jl:33
  [5] ModulePassManager
    @ C:\Users\nicho\.julia\packages\LLVM\UqMfW\src\passmanager.jl:30 [inlined]
  [6] removeDeadArgs!(mod::LLVM.Module, tm::LLVM.TargetMachine)
    @ Enzyme.Compiler C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler\optimize.jl:2008  
  [7] post_optimze!(mod::LLVM.Module, tm::LLVM.TargetMachine, machine::Bool)
    @ Enzyme.Compiler C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler\optimize.jl:2283  
  [8] post_optimze!(mod::LLVM.Module, tm::LLVM.TargetMachine)
    @ Enzyme.Compiler C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler\optimize.jl:2282  
  [9] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool)
    @ Enzyme.Compiler C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler.jl:7260
 [10] _thunk
    @ C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler.jl:7241 [inlined]
 [11] cached_compilation
    @ C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler.jl:7282 [inlined]
 [12] thunkbase(ctx::LLVM.Context, mi::Core.MethodInstance, ::Val{…}, ::Type{…}, ::Type{…}, tt::Type{…}, ::Val{…}, ::Val{…}, ::Val{…}, ::Val{…}, ::Val{…}, ::Type{…}, ::Val{…})
    @ Enzyme.Compiler C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler.jl:7355
 [13] #s2055#19000
    @ C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\compiler.jl:7407 [inlined]
 [14]
    @ Enzyme.Compiler .\none:0
 [15] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
    @ Core .\boot.jl:602
 [16] autodiff(::ReverseMode{…}, ::Const{…}, ::Type{…}, ::Const{…}, ::Duplicated{…}, ::Const{…}, ::Const{…})
    @ Enzyme C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\Enzyme.jl:263
 [17] autodiff
    @ C:\Users\nicho\.julia\packages\Enzyme\TiboG\src\Enzyme.jl:332 [inlined]
 [18] macro expansion
    @ C:\Users\nicho\.julia\packages\Flux\HBF2N\ext\FluxEnzymeExt\FluxEnzymeExt.jl:34 [inlined]  
 [19] macro expansion
    @ C:\Users\nicho\.julia\packages\ProgressLogging\6KXlp\src\ProgressLogging.jl:328 [inlined]  
 [20] train!(loss::Function, model::Duplicated{…}, data::Tuple{…}, opt::@NamedTuple{…}; cb::Nothing)
    @ FluxEnzymeExt C:\Users\nicho\.julia\packages\Flux\HBF2N\ext\FluxEnzymeExt\FluxEnzymeExt.jl:30
 [21] train!(loss::Function, model::Duplicated{DecodeNet{…}}, data::Tuple{Tuple{…}}, opt::@NamedTuple{arch::@NamedTuple{…}})
    @ FluxEnzymeExt C:\Users\nicho\.julia\packages\Flux\HBF2N\ext\FluxEnzymeExt\FluxEnzymeExt.jl:27
 [22] train_network(name::String; learning_rate_schedule::Vector{…}, training_batch_size::Int64, evaluation_batch_size::Int64, iters_per_eval::Int64, seed::Int64, decode::Bool, wandb::Bool)     
    @ Main c:\Users\nicho\Repos\DeepLoco.jl\src\train.jl:213
wsmoses commented 1 month ago

Can you include a complete runnable code to try to reproduce? As well as show your OS/various package versions?

BioTurboNick commented 1 month ago

I'll see if I can boil down a MWE. In the meantime:

Julia Version 1.10.5
Commit 6f3fdf7b36 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × Snapdragon(R) X 12-core X1E80100 @ 3.40 GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, bdver1)
Threads: 1 default, 0 interactive, 1 GC (on 12 virtual cores)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS =

Status `~/Repos/DeepLoco.jl/Project.toml`
⌅ [052768ef] CUDA v5.4.3
  [082447d4] ChainRules v1.71.0
  [992eb4ea] CondaPkg v0.2.23
  [b4f34e82] Distances v0.10.11
  [31c24e10] Distributions v0.25.111
⌃ [7da242da] Enzyme v0.12.36
  [587475ba] Flux v0.14.19
  [033835bb] JLD2 v0.5.2
  [f1d291b0] MLUtils v0.4.4
  [91a5bcdd] Plots v1.40.8
  [6099a3de] PythonCall v0.9.23
  [e88e6eb3] Zygote v0.6.70
  [02a925ec] cuDNN v1.3.2
  [37e2e46d] LinearAlgebra
  [9a3f8284] Random
  [10745b16] Statistics v1.10.0

I see Enzyme just recently bumped to 0.13, but seems Flux doesn't support it yet. Also, I realize this is Julia x86_64 running in emulation on Windows on ARM. I'll try on Julia AArch64 via WSL later.