fix `reset_search_direction!` failure when training with GPU

wei3li commented 1 year ago

When training with GPU, sometimes the following error will occur, causing function optimize failure.

ERROR: GPU compilation of kernel #broadcast_kernel#28(CUDA.CuKernelContext, CuDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64) failed
KernelError: passing and using non-bitstype argument

Argument 4 to your kernel function is of type Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, which is not isbits:
  .args is of type Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}} which is not isbits.
    .1 is of type Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}} which is not isbits.
      .x is of type Matrix{Float64} which is not isbits.

Stacktrace:
  [1] check_invocation(job::GPUCompiler.CompilerJob)
    @ GPUCompiler ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/validation.jl:88
  [2] macro expansion
    @ ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:154 [inlined]
  [3] macro expansion
    @ ~/code/Programs/.julia/packages/TimerOutputs/LHjFw/src/TimerOutput.jl:253 [inlined]
  [4] macro expansion
    @ ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:152 [inlined]
  [5] emit_julia(job::GPUCompiler.CompilerJob; validate::Bool)
    @ GPUCompiler ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:83
  [6] emit_julia
    @ ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/utils.jl:77 [inlined]
  [7] cufunction_compile(job::GPUCompiler.CompilerJob, ctx::LLVM.Context)
    @ CUDA ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:359
  [8] #221
    @ ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:354 [inlined]
  [9] JuliaContext(f::CUDA.var"#221#222"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams, GPUCompiler.FunctionSpec{GPUArrays.var"#broadcast_kernel#28", Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}}})
    @ GPUCompiler ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/driver.jl:76
 [10] cufunction_compile(job::GPUCompiler.CompilerJob)
    @ CUDA ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:353
 [11] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/code/Programs/.julia/packages/GPUCompiler/S3TWf/src/cache.jl:90
 [12] cufunction(f::GPUArrays.var"#broadcast_kernel#28", tt::Type{Tuple{CUDA.CuKernelContext, CuDeviceMatrix{Float64, 1}, Base.Broadcast.Broadcasted{CUDA.CuArrayStyle{2}, Tuple{Base.OneTo{Int64}, Base.OneTo{Int64}}, typeof(identity), Tuple{Base.Broadcast.Extruded{Matrix{Float64}, Tuple{Bool, Bool}, Tuple{Int64, Int64}}}}, Int64}}; name::Nothing, always_inline::Bool, kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ CUDA ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:306
 [13] cufunction
    @ ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:300 [inlined]
 [14] macro expansion
    @ ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/compiler/execution.jl:102 [inlined]
 [15] #launch_heuristic#245
    @ ~/code/Programs/.julia/packages/CUDA/ZdCxS/src/gpuarrays.jl:17 [inlined]
 [16] _copyto!
    @ ~/code/Programs/.julia/packages/GPUArrays/XR4WO/src/host/broadcast.jl:65 [inlined]
 [17] materialize!
    @ ~/code/Programs/.julia/packages/GPUArrays/XR4WO/src/host/broadcast.jl:41 [inlined]
 [18] materialize!
    @ ./broadcast.jl:868 [inlined]
 [19] reset_search_direction!(state::Optim.BFGSState{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, d::Optim.ManifoldObjective{OnceDifferentiable{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat})
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/utilities/perform_linesearch.jl:17
 [20] perform_linesearch!(state::Optim.BFGSState{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat}, d::Optim.ManifoldObjective{OnceDifferentiable{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}})
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/utilities/perform_linesearch.jl:45
 [21] update_state!(d::OnceDifferentiable{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, state::Optim.BFGSState{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat})
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/multivariate/solvers/first_order/bfgs.jl:139
 [22] optimize(d::OnceDifferentiable{Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, initial_x::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat}, options::Optim.Options{Float64, Nothing}, state::Optim.BFGSState{CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}, Float64, CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}})
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/optimize.jl:54
 [23] optimize
    @ ~/code/Programs/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/optimize.jl:36 [inlined]
 [24] optimize(f::NLSolversBase.InplaceObjective{Nothing, var"#fg!#8"{typeof(loss)}, Nothing, Nothing, Nothing}, initial_x::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat}, options::Optim.Options{Float64, Nothing}; inplace::Bool, autodiff::Symbol)
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/interface.jl:142
 [25] optimize(f::NLSolversBase.InplaceObjective{Nothing, var"#fg!#8"{typeof(loss)}, Nothing, Nothing, Nothing}, initial_x::CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, method::BFGS{LineSearches.InitialStatic{Float64}, LineSearches.HagerZhang{Float64, Base.RefValue{Bool}}, Nothing, Nothing, Flat}, options::Optim.Options{Float64, Nothing})
    @ Optim ~/code/Programs/.julia/packages/Optim/tP8PJ/src/multivariate/optimize/interface.jl:141

This is because the code is trying to broadcast between memory and GPU. The constructor Matrix will build a matrix in computer memory, however, in the case of training with GPU, state.invH is a CuArray, which is in GPU.

wei3li commented 1 year ago

Hi @pkofod, could you please take a look at this pull request at your convenience?

pkofod commented 1 year ago

Thank you, yes it seems strange to require Matrix at that point when it was not required when initializing the types. I would request that you add a test of this bugfix. Thanks!

codecov[bot] commented 1 year ago

Codecov Report

Merging #1034 (8ad4fab) into master (1f1258c) will decrease coverage by 0.04%. The diff coverage is 100.00%.

:exclamation: Current head 8ad4fab differs from pull request most recent head 1bfc4d8. Consider uploading reports for the commit 1bfc4d8 to get more accurate results

@@            Coverage Diff             @@
##           master    #1034      +/-   ##
==========================================
- Coverage   85.40%   85.36%   -0.04%     
==========================================
  Files          43       43              
  Lines        3199     3198       -1     
==========================================
- Hits         2732     2730       -2     
- Misses        467      468       +1

Impacted Files	Coverage Δ
src/utilities/perform_linesearch.jl	`88.57% <100.00%> (-0.32%)`	:arrow_down:

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

wei3li commented 1 year ago

I would request that you add a test of this bugfix.

Hi @pkofod, this bug only occurs when optimizing with the BFGS method in GPU. After reviewing the current test cases, I noticed that none of them run in GPU. I am uncertain if it is wise to introduce GPU tests solely for this bug fix. Considering that the fix passes all current CPU test cases, would it be better to keep it as is?

pkofod commented 11 months ago

Thanks

ChrisRackauckas commented 9 months ago

This isn't fully generic so it breaks a lot downstream. Can it be reverted or fixed to allow generic arrays?

pkofod commented 9 months ago

I suppose it’s irrelevant given https://github.com/SciML/NeuralPDE.jl/pull/751#issuecomment-1751409833 ?

pkofod commented 9 months ago

I think the original “gpu compatability” was made by a sciml contributor, but apparently a test that tested the sciml relevant code was not added.

ChrisRackauckas commented 9 months ago

It's not irrelevant, the downstream fix there was done by upper bounding Optim in Optimization.jl

pkofod commented 9 months ago

What works for componentarrays? Scale*I+0*state.invH i suppose?

pkofod commented 9 months ago

@wei3li could you try master / 1.7.8 when it's tagged on your GPU problem? This should have used the internal functions we use initially to set the invH-matrices :) (_init_identity_matrix)

wei3li commented 9 months ago

@wei3li could you try master / 1.7.8 when it's tagged on your GPU problem? This should have used the internal functions we use initially to set the invH-matrices :) (_init_identity_matrix)

Hi @pkofod, the GPU problem I was experiencing has been resolved in version v1.7.8. Thank you for the update!

JuliaNLSolvers / Optim.jl

fix `reset_search_direction!` failure when training with GPU #1034

Codecov Report