JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.19k stars 214 forks source link

Enzyme: Support for reductions with GPU broadcasting #2455

Open jgreener64 opened 1 month ago

jgreener64 commented 1 month ago

Describe the bug

Reductions with GPU broadcasting error with Enzyme. @wsmoses suggested I open an issue here.

To reproduce

The Minimal Working Example (MWE) for this bug:

using Enzyme, CUDA
f(x, y) = sum(x .+ y)
x = CuArray(rand(5))
y = CuArray(rand(5))
dx = CuArray([1.0, 0.0, 0.0, 0.0, 0.0])
autodiff(Reverse, f, Active, Duplicated(x, dx), Const(y))
ERROR: Enzyme execution failed.
Enzyme compilation failed.

No create nofree of empty function (jl_gc_safe_enter) jl_gc_safe_enter)
 at context:   call fastcc void @julia__launch_configuration_979_4373([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %7, i64 noundef signext 0, { i64, {} addrspace(10)* } addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %45) #715, !dbg !1090 (julia__launch_configuration_979_4373)

Stacktrace:
 [1] launch_configuration
   @ ~/.julia/dev/CUDA/lib/cudadrv/occupancy.jl:56
 [2] #launch_heuristic#1204
   @ ~/.julia/dev/CUDA/src/gpuarrays.jl:22
 [3] launch_heuristic
   @ ~/.julia/dev/CUDA/src/gpuarrays.jl:15
 [4] _copyto!
   @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:78
 [5] copyto!
   @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:44
 [6] copy
   @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:29
 [7] materialize
   @ ./broadcast.jl:903
 [8] f
   @ ./REPL[2]:1

Stacktrace:
  [1] throwerr(cstr::Cstring)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1797
  [2] launch_configuration
    @ ~/.julia/dev/CUDA/lib/cudadrv/occupancy.jl:56 [inlined]
  [3] #launch_heuristic#1204
    @ ~/.julia/dev/CUDA/src/gpuarrays.jl:22 [inlined]
  [4] launch_heuristic
    @ ~/.julia/dev/CUDA/src/gpuarrays.jl:15 [inlined]
  [5] _copyto!
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:78 [inlined]
  [6] copyto!
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:44 [inlined]
  [7] copy
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:29 [inlined]
  [8] materialize
    @ ./broadcast.jl:903 [inlined]
  [9] f
    @ ./REPL[2]:1 [inlined]
 [10] diffejulia_f_2820wrap
    @ ./REPL[2]:0
 [11] macro expansion
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6819 [inlined]
 [12] enzyme_call
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6419 [inlined]
 [13] CombinedAdjointThunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6296 [inlined]
 [14] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:314 [inlined]
 [15] autodiff(::ReverseMode{…}, ::typeof(f), ::Type{…}, ::Duplicated{…}, ::Const{…})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:326
 [16] top-level scope
    @ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.

Forward mode also fails. This is with Julia 1.10.3, Enzyme 0.12.26, GPUCompiler 0.26.7 and CUDA d7077da2b7df32f9d0a2bced56511cdd778ab4ed.

Julia Version 1.10.3
Commit 0b4590a5507 (2024-04-30 10:59 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 18 default, 0 interactive, 9 GC (on 36 virtual cores)
Environment:
  LD_LIBRARY_PATH = /usr/local/gromacs/lib

Details on CUDA:

UDA runtime 12.5, artifact installation
CUDA driver 12.5
NVIDIA driver 535.183.1, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+535.183.1

Julia packages: 
- CUDA: 5.4.3
- CUDA_Driver_jll: 0.9.1+1
- CUDA_Runtime_jll: 0.14.1+0

Toolchain:
- Julia: 1.10.3
- LLVM: 15.0.7

2 devices:
  0: NVIDIA RTX A6000 (sm_86, 46.970 GiB / 47.988 GiB available)
  1: NVIDIA RTX A6000 (sm_86, 4.046 GiB / 47.988 GiB available)
wsmoses commented 1 month ago

Hm that error looks like you’re not running with the enzyme cuda ext package. With it I get an errror in mapreduce!

In essence we just ought properly define the derivative kernel for that so I’d argue it’s more feature dev than bug

On Wed, Jul 31, 2024 at 12:49 PM Joe Greener @.***> wrote:

Describe the bug

Reductions with GPU broadcasting error with Enzyme. @wsmoses https://github.com/wsmoses suggested I open an issue here.

To reproduce

The Minimal Working Example (MWE) for this bug:

using Enzyme, CUDAf(x, y) = sum(x .+ y) x = CuArray(rand(5)) y = CuArray(rand(5)) dx = CuArray([1.0, 0.0, 0.0, 0.0, 0.0])autodiff(Reverse, f, Active, Duplicated(x, dx), Const(y))

ERROR: Enzyme execution failed. Enzyme compilation failed.

No create nofree of empty function (jl_gc_safe_enter) jl_gc_safe_enter) at context: call fastcc void @julia__launch_configuration_979_4373([2 x i64] noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %7, i64 noundef signext 0, { i64, {} addrspace(10) } addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(32) %45) #715, !dbg !1090 (julia__launch_configuration_979_4373)

Stacktrace: [1] launch_configuration @ ~/.julia/dev/CUDA/lib/cudadrv/occupancy.jl:56 [2] #launch_heuristic#1204 @ ~/.julia/dev/CUDA/src/gpuarrays.jl:22 [3] launch_heuristic @ ~/.julia/dev/CUDA/src/gpuarrays.jl:15 [4] _copyto! @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:78 [5] copyto! @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:44 [6] copy @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:29 [7] materialize @ ./broadcast.jl:903 [8] f @ ./REPL[2]:1

Stacktrace: [1] throwerr(cstr::Cstring) @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1797 [2] launch_configuration @ ~/.julia/dev/CUDA/lib/cudadrv/occupancy.jl:56 [inlined] [3] #launch_heuristic#1204 @ ~/.julia/dev/CUDA/src/gpuarrays.jl:22 [inlined] [4] launch_heuristic @ ~/.julia/dev/CUDA/src/gpuarrays.jl:15 [inlined] [5] _copyto! @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:78 [inlined] [6] copyto! @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:44 [inlined] [7] copy @ ~/.julia/packages/GPUArrays/bbZD0/src/host/broadcast.jl:29 [inlined] [8] materialize @ ./broadcast.jl:903 [inlined] [9] f @ ./REPL[2]:1 [inlined] [10] diffejulia_f_2820wrap @ ./REPL[2]:0 [11] macro expansion @ ~/.julia/dev/Enzyme/src/compiler.jl:6819 [inlined] [12] enzyme_call @ ~/.julia/dev/Enzyme/src/compiler.jl:6419 [inlined] [13] CombinedAdjointThunk @ ~/.julia/dev/Enzyme/src/compiler.jl:6296 [inlined] [14] autodiff @ ~/.julia/dev/Enzyme/src/Enzyme.jl:314 [inlined] [15] autodiff(::ReverseMode{…}, ::typeof(f), ::Type{…}, ::Duplicated{…}, ::Const{…}) @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:326 [16] top-level scope @ REPL[6]:1 Some type information was truncated. Use show(err) to see complete types.

Forward mode also fails. This is with Julia 1.10.3, Enzyme 0.12.26, GPUCompiler 0.26.7 and CUDA d7077da https://github.com/JuliaGPU/CUDA.jl/commit/d7077da2b7df32f9d0a2bced56511cdd778ab4ed .

Julia Version 1.10.3 Commit 0b4590a5507 (2024-04-30 10:59 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 36 × Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake) Threads: 18 default, 0 interactive, 9 GC (on 36 virtual cores) Environment: LD_LIBRARY_PATH = /usr/local/gromacs/lib

Details on CUDA:

UDA runtime 12.5, artifact installation CUDA driver 12.5 NVIDIA driver 535.183.1, originally for CUDA 12.2

CUDA libraries:

  • CUBLAS: 12.5.3
  • CURAND: 10.3.6
  • CUFFT: 11.2.3
  • CUSOLVER: 11.6.3
  • CUSPARSE: 12.5.1
  • CUPTI: 2024.2.1 (API 23.0.0)
  • NVML: 12.0.0+535.183.1

Julia packages:

  • CUDA: 5.4.3
  • CUDA_Driver_jll: 0.9.1+1
  • CUDA_Runtime_jll: 0.14.1+0

Toolchain:

  • Julia: 1.10.3
  • LLVM: 15.0.7

2 devices: 0: NVIDIA RTX A6000 (sm_86, 46.970 GiB / 47.988 GiB available) 1: NVIDIA RTX A6000 (sm_86, 4.046 GiB / 47.988 GiB available)

— Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/CUDA.jl/issues/2455, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXEGBSVQW5FNKW2HA6DZPEIRNAVCNFSM6AAAAABLY3VPKGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2DANBYGQYTEMA . You are receiving this because you were mentioned.Message ID: @.***>

maleadt commented 1 month ago

I don't see how this is a CUDA.jl issue.

wsmoses commented 1 month ago

Sorry, I mentioned in the earlier issue in Enzyme.jl -- I recommended Joe open an issue here since I think the resolution is extending the Enzyme Cuda ext with a rule that says the derivative of https://github.com/JuliaGPU/CUDA.jl/blob/d7077da2b7df32f9d0a2bced56511cdd778ab4ed/src/mapreduce.jl#L169 is [corresponding derivative fn].

maleadt commented 1 month ago

Fair enough! Hope you don't mind me assigning the issue to you then 🙂

wsmoses commented 1 month ago

Oh yeah for sure, kind of assumed that :P