EnzymeAD / Enzyme.jl

Julia bindings for the Enzyme automatic differentiator
https://enzyme.mit.edu
MIT License
455 stars 64 forks source link

`CuArray` broadcasting #1454

Closed jgreener64 closed 3 months ago

jgreener64 commented 5 months ago

Opening this to track progress in taking gradients through CuArray broadcasting. With Enzyme main (a68bf83) and CUDA v5.3.4:

using Enzyme, CUDA
f(x, y) = sum(x .+ y)
x = CuArray(rand(5))
y = CuArray(rand(5))
dx = CuArray([1.0, 0.0, 0.0, 0.0, 0.0])

For forward mode:

autodiff(Forward, f, Duplicated, Duplicated(x, dx), Const(y))
[3915854] signal (11.2): Segmentation fault
in expression starting at REPL[12]:1
unknown function (ip: 0x7f9c7fcaf1b0)
visitIntrinsicInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:3696
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
GetOrCreateShadowFunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/GradientUtils.cpp:4622
invertPointerM at /workspace/srcdir/Enzyme/enzyme/Enzyme/GradientUtils.cpp:5533
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4914
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
EnzymeCreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/CApi.cpp:591
EnzymeCreateForwardDiff at /home/jgreener/.julia/dev/Enzyme/src/api.jl:168
unknown function (ip: 0x7f9cd004b93a)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
enzyme! at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:3261
unknown function (ip: 0x7f9cd00473e8)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#codegen#518 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5142
codegen at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:4549 [inlined]
_thunk at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5839
_thunk at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5839 [inlined]
cached_compilation at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5877 [inlined]
#563 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5943
#JuliaContext#147 at /home/jgreener/.julia/dev/GPUCompiler/src/driver.jl:52
unknown function (ip: 0x7f9cd01e2216)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
JuliaContext at /home/jgreener/.julia/dev/GPUCompiler/src/driver.jl:42
#s2042#562 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5895 [inlined]
#s2042#562 at ./none:0
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
GeneratedFunctionStub at ./boot.jl:602
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_call_staged at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/method.c:540
ijl_code_for_staged at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/method.c:593
get_staged at ./compiler/utilities.jl:123
retrieve_code_info at ./compiler/utilities.jl:135 [inlined]
InferenceState at ./compiler/inferencestate.jl:430
typeinf_edge at ./compiler/typeinfer.jl:920
abstract_call_method at ./compiler/abstractinterpretation.jl:629
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:95
abstract_call_known at ./compiler/abstractinterpretation.jl:2087
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_call at ./compiler/abstractinterpretation.jl:2162
abstract_call at ./compiler/abstractinterpretation.jl:2354
abstract_eval_call at ./compiler/abstractinterpretation.jl:2370
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2380
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2624
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2889
typeinf_local at ./compiler/abstractinterpretation.jl:3098
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3186
_typeinf at ./compiler/typeinfer.jl:247
typeinf at ./compiler/typeinfer.jl:216
typeinf_edge at ./compiler/typeinfer.jl:930
abstract_call_method at ./compiler/abstractinterpretation.jl:629
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:95
abstract_call_known at ./compiler/abstractinterpretation.jl:2087
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_apply at ./compiler/abstractinterpretation.jl:1612
abstract_call_known at ./compiler/abstractinterpretation.jl:2004
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_call at ./compiler/abstractinterpretation.jl:2162
abstract_call at ./compiler/abstractinterpretation.jl:2354
abstract_eval_call at ./compiler/abstractinterpretation.jl:2370
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2380
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2624
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2913
typeinf_local at ./compiler/abstractinterpretation.jl:3098
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3186
_typeinf at ./compiler/typeinfer.jl:247
typeinf at ./compiler/typeinfer.jl:216
typeinf_ext at ./compiler/typeinfer.jl:1051
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1082
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1078
jfptr_typeinf_ext_toplevel_45276.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_type_infer at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:394
jl_generate_fptr_impl at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jitlayers.cpp:502
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2480 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2368
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2886 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:877
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:579
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:544
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:934
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#start_repl_backend#46 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#run_repl#59 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
run_repl at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
jfptr_run_repl_91745.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#1013 at ./client.jl:432
jfptr_YY.1013_82712.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:892 [inlined]
invokelatest at ./essentials.jl:889 [inlined]
run_main_repl at ./client.jl:416
exec_options at ./client.jl:333
_start at ./client.jl:552
jfptr__start_82738.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 235658985 (Pool: 235384224; Big: 274761); GC: 112
Segmentation fault (core dumped)

For reverse mode:

autodiff(Reverse, f, Active, Duplicated(x, dx), Const(y))
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
ERROR: Enzyme compilation failed.
Current scope:
; Function Attrs: mustprogress willreturn
define internal fastcc nonnull dereferenceable(16) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Pointer, [-1,8,0]:Pointer, [-1,8,8]:Integer, [-1,8,16]:Pointer}" {} addrspace(10)* @preprocess_julia___910_21700([1 x i32] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(4) "enzyme_inactive" "enzyme_type"="{[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer}" "enzymejl_parmtype"="140596959426144" "enzymejl_parmtype_ref"="1" %0) unnamed_addr #428 !dbg !27759 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = call {}*** @julia.get_pgcstack()
  %3 = bitcast {}*** %2 to {}**
  %4 = getelementptr inbounds {}*, {}** %3, i64 -14
  %5 = getelementptr inbounds {}*, {}** %4, i64 16
  %6 = bitcast {}** %5 to i8**
  %7 = load i8*, i8** %6, align 8
  %8 = call noalias nonnull dereferenceable(16) dereferenceable_or_null(16) {} addrspace(10)* @julia.gc_alloc_obj({}** %4, i64 16, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595685148560 to {}*) to {} addrspace(10)*)), !enzyme_fromstack !409
  call void @zeroType.264({} addrspace(10)* %8, i8 0, i64 16), !enzyme_zerostack !374
  %9 = bitcast {} addrspace(10)* %8 to [2 x {} addrspace(10)*] addrspace(10)*, !enzyme_caststack !374
  %10 = bitcast {}*** %1 to {}**
  %11 = getelementptr inbounds {}*, {}** %10, i64 -14
  %12 = getelementptr inbounds {}*, {}** %11, i64 16
  %13 = bitcast {}** %12 to i8**
  %14 = load i8*, i8** %13, align 8
  %15 = call noalias nonnull dereferenceable(8) dereferenceable_or_null(8) {} addrspace(10)* @julia.gc_alloc_obj({}** %11, i64 8, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595682882576 to {}*) to {} addrspace(10)*)), !enzyme_fromstack !409
  call void @zeroType.265({} addrspace(10)* %15, i8 0, i64 8), !enzyme_zerostack !374
  %16 = bitcast {} addrspace(10)* %15 to [1 x {} addrspace(10)*] addrspace(10)*, !enzyme_caststack !374
  %17 = call {}*** @julia.get_pgcstack() #451
  %current_task119 = getelementptr inbounds {}**, {}*** %17, i64 -14
  %current_task1 = bitcast {}*** %current_task119 to {}**
  %ptls_field20 = getelementptr inbounds {}**, {}*** %17, i64 2
  %18 = bitcast {}*** %ptls_field20 to i64***
  %ptls_load2122 = load i64**, i64*** %18, align 8, !tbaa !375
  %19 = getelementptr inbounds i64*, i64** %ptls_load2122, i64 2
  %safepoint = load i64*, i64** %19, align 8, !tbaa !379
  fence syncscope("singlethread") seq_cst
  call void @julia.safepoint(i64* %safepoint) #451, !dbg !27760
  fence syncscope("singlethread") seq_cst
  %20 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(11)* %0, i64 0, i64 0, !dbg !27761
  %unbox = load i32, i32 addrspace(11)* %20, align 4, !dbg !27765, !tbaa !379, !alias.scope !585, !noalias !586
  %21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !27767
  %box = call noalias nonnull dereferenceable(4) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 4, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972381280 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %22 = bitcast {} addrspace(10)* %box to i32 addrspace(10)*, !dbg !27764
  store i32 1, i32 addrspace(10)* %22, align 8, !dbg !27764, !tbaa !445, !alias.scope !395, !noalias !27768
  %box4 = call noalias nonnull dereferenceable(4) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 4, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972381008 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %23 = bitcast {} addrspace(10)* %box4 to i32 addrspace(10)*, !dbg !27764
  store i32 0, i32 addrspace(10)* %23, align 8, !dbg !27764, !tbaa !445, !alias.scope !395, !noalias !27768
  %box6 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972383920 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %24 = bitcast {} addrspace(10)* %box6 to i8 addrspace(10)*, !dbg !27764
  %newstruct.sroa.0.0..sroa_cast = bitcast {} addrspace(10)* %box6 to i32 addrspace(10)*, !dbg !27764
  store i32 1, i32 addrspace(10)* %newstruct.sroa.0.0..sroa_cast, align 8, !dbg !27764, !tbaa !489, !alias.scope !490, !noalias !27771
  %newstruct.sroa.2.0..sroa_idx = getelementptr inbounds i8, i8 addrspace(10)* %24, i64 4, !dbg !27764
  %newstruct.sroa.2.0..sroa_cast = bitcast i8 addrspace(10)* %newstruct.sroa.2.0..sroa_idx to i32 addrspace(10)*, !dbg !27764
  store i32 %unbox, i32 addrspace(10)* %newstruct.sroa.2.0..sroa_cast, align 4, !dbg !27764, !tbaa !489, !alias.scope !490, !noalias !27771
  %25 = call noalias nonnull "enzyme_inactive" {} addrspace(10)* @ijl_box_int64(i64 noundef signext 0) #453, !dbg !27764
  %26 = call nonnull {} addrspace(10)* ({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)*, {} addrspace(10)*, ...) @julia.call({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)* noundef nonnull @ijl_apply_generic, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972383392 to {}*) to {} addrspace(10)*), {} addrspace(10)* nofree nonnull %box, {} addrspace(10)* nofree nonnull %box4, {} addrspace(10)* nofree nonnull %box6, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604561424448 to {}*) to {} addrspace(10)*), {} addrspace(10)* nonnull %25, {} addrspace(10)* nonnull %21) #454, !dbg !27764
  %newstruct8 = call noalias nonnull dereferenceable(88) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 88, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972389216 to {}*) to {} addrspace(10)*)) #452, !dbg !27772
  %27 = addrspacecast {} addrspace(10)* %newstruct8 to i8 addrspace(11)*, !dbg !27772
  %28 = addrspacecast {} addrspace(10)* %26 to i8 addrspace(11)*, !dbg !27772
  call void @llvm.memcpy.p11i8.p11i8.i64(i8 addrspace(11)* noundef align 8 dereferenceable(88) %27, i8 addrspace(11)* noundef align 1 dereferenceable(88) %28, i64 noundef 88, i1 noundef false) #451, !dbg !27772, !tbaa !392, !alias.scope !395, !noalias !27768
  %newstruct10 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972388992 to {}*) to {} addrspace(10)*)) #452, !dbg !27775
  %.fca.0.gep18 = getelementptr [2 x {} addrspace(10)*], [2 x {} addrspace(10)*] addrspace(10)* %9, i64 0, i64 0, !dbg !27778
  store {} addrspace(10)* %newstruct10, {} addrspace(10)* addrspace(10)* %.fca.0.gep18, align 8, !dbg !27778, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %8, {} addrspace(10)* %newstruct10), !dbg !27778
  %.fca.1.gep = getelementptr [2 x {} addrspace(10)*], [2 x {} addrspace(10)*] addrspace(10)* %9, i64 0, i64 1, !dbg !27778
  store {} addrspace(10)* %newstruct8, {} addrspace(10)* addrspace(10)* %.fca.1.gep, align 8, !dbg !27778, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %8, {} addrspace(10)* %newstruct8), !dbg !27778
  %29 = addrspacecast [2 x {} addrspace(10)*] addrspace(10)* %9 to [2 x {} addrspace(10)*] addrspace(11)*, !dbg !27778
  %30 = call fastcc i32 @julia__395_21708([2 x {} addrspace(10)*] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(16) %29) #451, !dbg !27778
  %31 = icmp eq i32 %30, 0, !dbg !27782
  br i1 %31, label %L32, label %L28, !dbg !27785

L28:                                              ; preds = %top
  call fastcc void @julia_throw_api_error_20396(i32 zeroext %30) #455, !dbg !27786
  unreachable, !dbg !27786

L32:                                              ; preds = %top
  %newstruct12 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972386592 to {}*) to {} addrspace(10)*)) #452, !dbg !27787
  %.fca.0.gep = getelementptr [1 x {} addrspace(10)*], [1 x {} addrspace(10)*] addrspace(10)* %16, i64 0, i64 0, !dbg !27791
  store {} addrspace(10)* %newstruct12, {} addrspace(10)* addrspace(10)* %.fca.0.gep, align 8, !dbg !27791, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %15, {} addrspace(10)* %newstruct12), !dbg !27791
  %32 = addrspacecast [1 x {} addrspace(10)*] addrspace(10)* %16 to [1 x {} addrspace(10)*] addrspace(11)*, !dbg !27791
  call fastcc void @julia_check_20440([1 x {} addrspace(10)*] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(8) %32) #451, !dbg !27791
  %33 = addrspacecast {} addrspace(10)* %newstruct12 to i64 addrspace(11)*, !dbg !27793
  %34 = load i64, i64 addrspace(11)* %33, align 8, !dbg !27793, !tbaa !420, !alias.scope !395, !noalias !398
  %.not = icmp eq i64 %34, 0, !dbg !27796
  br i1 %.not, label %L39, label %L41, !dbg !27795

L39:                                              ; preds = %L32
  call void @ijl_throw({} addrspace(12)* noundef addrspacecast ({}* inttoptr (i64 140604626762208 to {}*) to {} addrspace(12)*)) #455, !dbg !27795
  unreachable, !dbg !27795

L41:                                              ; preds = %L32
  %35 = call fastcc nonnull {} addrspace(10)* @julia_UniqueCuContext_20482(i64 zeroext %34) #451, !dbg !27798
  %36 = addrspacecast {} addrspace(10)* %newstruct10 to i64 addrspace(11)*, !dbg !27799
  %37 = load i64, i64 addrspace(11)* %36, align 8, !dbg !27799, !tbaa !420, !alias.scope !395, !noalias !398
  %newstruct15 = call noalias nonnull dereferenceable(16) {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 16, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596959507904 to {}*) to {} addrspace(10)*)) #452, !dbg !27801
  %38 = addrspacecast {} addrspace(10)* %newstruct15 to {} addrspace(10)* addrspace(11)*, !dbg !27801
  %39 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %38, i64 1, !dbg !27801
  store {} addrspace(10)* null, {} addrspace(10)* addrspace(11)* %39, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  %40 = addrspacecast {} addrspace(10)* %newstruct15 to i64 addrspace(11)*, !dbg !27801
  store i64 %37, i64 addrspace(11)* %40, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  %41 = addrspacecast {} addrspace(10)* %newstruct15 to i8 addrspace(11)*, !dbg !27801
  %42 = getelementptr inbounds i8, i8 addrspace(11)* %41, i64 8, !dbg !27801
  %43 = bitcast i8 addrspace(11)* %42 to {} addrspace(10)* addrspace(11)*, !dbg !27801
  store atomic {} addrspace(10)* %35, {} addrspace(10)* addrspace(11)* %43 release, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  ret {} addrspace(10)* %newstruct15, !dbg !27801
}

Illegal replace ficticious phi for:   %_replacementA14 = phi {} addrspace(10)* , !dbg !390 of   %21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !406
; Function Attrs: mustprogress willreturn
define internal fastcc nonnull dereferenceable(16) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Pointer, [-1,8,0]:Pointer, [-1,8,8]:Integer, [-1,8,16]:Pointer}" void @diffejulia___910_21700([1 x i32] addrspace(11)* nocapture readonly align 4 dereferenceable(4) "enzyme_inactive" "enzyme_type"="{[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer}" "enzymejl_parmtype"="140596959426144" "enzymejl_parmtype_ref"="1" %0, { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg) unnamed_addr #428 !dbg !32267 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = call {}*** @julia.get_pgcstack()
  %_replacementA31 = phi {}**
  %_replacementA30 = phi {}**
  %_replacementA29 = phi {}**
  %_replacementA28 = phi i8**
  %_replacementA27 = phi i8*
  %_replacementA26 = phi {} addrspace(10)*
  %_replacementA25 = phi [2 x {} addrspace(10)*] addrspace(10)*
  %_replacementA24 = phi {}**
  %_replacementA23 = phi {}**
  %_replacementA22 = phi {}**
  %_replacementA21 = phi i8**
  %_replacementA20 = phi i8*
  %_replacementA19 = phi {} addrspace(10)*
  %_replacementA18 = phi [1 x {} addrspace(10)*] addrspace(10)*
  %3 = call {}*** @julia.get_pgcstack() #451
  %current_task119 = getelementptr inbounds {}**, {}*** %3, i64 -14
  %current_task1 = bitcast {}*** %current_task119 to {}**
  %ptls_field20_replacementA = phi {}***
  %_replacementA17 = phi i64***
  %ptls_load2122_replacementA = phi i64**
  %_replacementA16 = phi i64**
  %safepoint_replacementA = phi i64*
  %_replacementA15 = phi i32 addrspace(11)* , !dbg !32268
  %unbox_replacementA = phi i32 , !dbg !32272
  %_replacementA14 = phi {} addrspace(10)* , !dbg !32274
  %box_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA13 = phi i32 addrspace(10)* , !dbg !32271
  %box4_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA12 = phi i32 addrspace(10)* , !dbg !32271
  %box6_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA11 = phi i8 addrspace(10)* , !dbg !32271
  %newstruct.sroa.0.0..sroa_cast_replacementA = phi i32 addrspace(10)* , !dbg !32271
  %newstruct.sroa.2.0..sroa_idx_replacementA = phi i8 addrspace(10)* , !dbg !32271
  %newstruct.sroa.2.0..sroa_cast_replacementA = phi i32 addrspace(10)* , !dbg !32271
  %_replacementA10 = phi {} addrspace(10)* , !dbg !32271
  %4 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 2, !dbg !32271
  %_replacementA9 = phi {} addrspace(10)* , !dbg !32271
  %_replacementA8 = phi i8 addrspace(11)* , !dbg !32275
  %_replacementA7 = phi i8 addrspace(11)* , !dbg !32275
  %newstruct10_replacementA = phi {} addrspace(10)* , !dbg !32278
  %.fca.0.gep18_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32281
  %.fca.1.gep_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32281
  %_replacementA6 = phi [2 x {} addrspace(10)*] addrspace(11)* , !dbg !32281
  %_replacementA5 = phi i32 , !dbg !32281
  %_replacementA = phi i1 , !dbg !32284
  br i1 true, label %L32, label %L28, !dbg !32287

L28:                                              ; preds = %top
  unreachable

L32:                                              ; preds = %top
  %newstruct12_replacementA = phi {} addrspace(10)* , !dbg !32288
  %.fca.0.gep_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32292
  %_replacementA33 = phi [1 x {} addrspace(10)*] addrspace(11)* , !dbg !32292
  %_replacementA32 = phi i64 addrspace(11)* , !dbg !32294
  %5 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 7, !dbg !32297
  %.not_replacementA = phi i1 , !dbg !32297
  br i1 false, label %L39, label %L41, !dbg !32296

L39:                                              ; preds = %L32
  unreachable

L41:                                              ; preds = %L32
  %tapeArg42 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 5, !dbg !32299
  %_replacementA43 = phi {} addrspace(10)* , !dbg !32299
  %"'ip_phi3" = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 6, !dbg !32299
  %_replacementA41 = phi i64 addrspace(11)* , !dbg !32300
  %"newstruct15'mi" = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 4, !dbg !32302
  %newstruct15_replacementA = phi {} addrspace(10)* , !dbg !32302
  %_replacementA39 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  %_replacementA38 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  %_replacementA37 = phi i64 addrspace(11)* , !dbg !32302
  %_replacementA36 = phi i8 addrspace(11)* , !dbg !32302
  %_replacementA35 = phi i8 addrspace(11)* , !dbg !32302
  %_replacementA34 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  br label %invertL41, !dbg !32302

allocsForInversion:                               ; No predecessors!

inverttop:                                        ; preds = %invertL32
  %6 = call {} addrspace(10)* ({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)*, {} addrspace(10)*, ...) @julia.call({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)* @ijl_apply_generic, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596842791248 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596298768528 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604603211760 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595658237672 to {}*) to {} addrspace(10)*), {} addrspace(10)* %4, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596972383392 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box4_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box6_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604561424448 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %_replacementA10, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %_replacementA14, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*)), !dbg !32271
  fence syncscope("singlethread") seq_cst
  fence syncscope("singlethread") seq_cst
  ret void

invertL28:                                        ; No predecessors!

invertL32:                                        ; preds = %invertL41
  br label %inverttop

invertL39:                                        ; No predecessors!

invertL41:                                        ; preds = %L41
  call fastcc void @diffejulia_UniqueCuContext_20482(i64 zeroext %5, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 } %tapeArg42), !dbg !32299
  br label %invertL32
}

LLVM.CallInst(%21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !406)
LLVM.PHIInst(%_replacementA14 = phi {} addrspace(10)* , !dbg !390)

Stacktrace:
 [1] ntuple
   @ ./ntuple.jl:19
 [2] _
   @ ~/.julia/dev/CUDA/lib/cudadrv/pool.jl:18

Stacktrace:
  [1] julia_error(cstr::Cstring, val::Ptr{…}, errtype::Enzyme.API.ErrorType, data::Ptr{…}, data2::Ptr{…}, B::Ptr{…})
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1754
  [2] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{…}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{…}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{…}, augmented::Ptr{…}, atomicAdd::Bool)
    @ Enzyme.API ~/.julia/dev/Enzyme/src/api.jl:154
  [3] enzyme!(job::GPUCompiler.CompilerJob{…}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::Tuple{…}, returnPrimal::Bool, expectedTapeType::Type, loweredArgs::Set{…}, boxedArgs::Set{…})
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:3249
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob{…}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5142
  [5] codegen
    @ ~/.julia/dev/Enzyme/src/compiler.jl:4549 [inlined]
  [6] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5839
  [7] _thunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5839 [inlined]
  [8] cached_compilation
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5877 [inlined]
  [9] (::Enzyme.Compiler.var"#563#564"{…})(ctx::LLVM.Context)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5943
 [10] JuliaContext(f::Enzyme.Compiler.var"#563#564"{…}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:52
 [11] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:42
 [12] #s2042#562
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5895 [inlined]
 [13]
    @ Enzyme.Compiler ./none:0
 [14] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
    @ Core ./boot.jl:602
 [15] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:286 [inlined]
 [16] autodiff(::ReverseMode{false, FFIABI, false}, ::typeof(f), ::Type{Active}, ::Duplicated{CuArray{…}}, ::Const{CuArray{…}})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:303
 [17] top-level scope
    @ REPL[11]:1
 [18] top-level scope
    @ ~/.julia/dev/CUDA/src/initialization.jl:209
Some type information was truncated. Use `show(err)` to see complete types.
wsmoses commented 5 months ago

With debug info:

Cannot create a null constant of that type!
UNREACHABLE executed at /home/wmoses/git/Enzyme.jl/julia10/deps/srccache/llvm-julia-15.0.7-10/llvm/lib/IR/Constants.cpp:374!

[3398190] signal (6.-6): Aborted
in expression starting at /home/wmoses/git/Enzyme.jl/cubc.jl:6
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
llvm_unreachable_internal at /home/wmoses/git/Enzyme.jl/julia10/deps/srccache/llvm-julia-15.0.7-10/llvm/lib/Support/ErrorHandling.cpp:212
getNullValue at /home/wmoses/git/Enzyme.jl/julia10/deps/srccache/llvm-julia-15.0.7-10/llvm/lib/IR/Constants.cpp:374
handleAdjointForIntrinsic at /home/wmoses/.julia/scratchspaces/7cc45869-7501-5eee-bdea-0790c847d4ef/src/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4023
visitIntrinsicInst at /home/wmoses/.julia/scratchspaces/7cc45869-7501-5eee-bdea-0790c847d4ef/src/Enzyme/enzyme/Enzyme/AdjointGenerator.h:3696
wsmoses commented 5 months ago

@jgreener64 the forward mode assertion should no longer err that way. It for some reason has a size mismatch error come up though.

If you have cycles, some minimization through the broadcast impl would definitely be helpful here.

jgreener64 commented 5 months ago

The forward mode error is

ERROR: DimensionMismatch: arrays could not be broadcast to a common size; got a dimension with lengths 5 and 5
Stacktrace:
  [1] _bcs1
    @ ./broadcast.jl:555 [inlined]
  [2] _bcs
    @ ./broadcast.jl:549 [inlined]
  [3] broadcast_shape
    @ ./broadcast.jl:543 [inlined]
  [4] combine_axes
    @ ./broadcast.jl:524 [inlined]
  [5] instantiate
    @ ./broadcast.jl:306 [inlined]
  [6] materialize
    @ ./broadcast.jl:903 [inlined]
  [7] f
    @ ./REPL[3]:1 [inlined]
  [8] fwddiffejulia_f_4514wrap
    @ ./REPL[3]:0
  [9] macro expansion
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5916 [inlined]
 [10] enzyme_call
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5566 [inlined]
 [11] ForwardModeThunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5446 [inlined]
 [12] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:399 [inlined]
 [13] autodiff(::ForwardMode{FFIABI}, ::typeof(f), ::Type{Duplicated}, ::Duplicated{CuArray{…}}, ::Const{CuArray{…}})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:303
 [14] top-level scope
    @ REPL[9]:1
 [15] top-level scope
    @ ~/.julia/dev/CUDA/src/initialization.jl:209
Some type information was truncated. Use `show(err)` to see complete types.

which certainly seems a strange one. Adding

println("a ", typeof(a), " ", a, " b ", typeof(b), " ", b, " ", a == b)

to the start of Base.Broadcast._bcs1(a, b) (https://github.com/JuliaLang/julia/blob/0b4590a5507d3f3046e5bafc007cacbbfc9b310b/base/broadcast.jl#L555), where a and b are the dimension sizes of the broadcasted arrays, gives the following for the primal function:

f(x, y)
a Base.OneTo{Int64} Base.OneTo(5) b Base.OneTo{Int64} Base.OneTo(5) true
a Base.OneTo{Int64} Base.OneTo(1) b Base.OneTo{Int64} Base.OneTo(1) true
a Base.OneTo{Int64} Base.OneTo(1) b Base.OneTo{Int64} Base.OneTo(1) true
5.617565028176828

and this for the gradient:

autodiff(Forward, f, Duplicated, Duplicated(x, dx), Const(y))
a Base.OneTo{Int64} Base.OneTo(5) b Base.OneTo{Int64} Base.OneTo(5) true
[error as above]

The error is effectively thrown when a != b, but the failure case doesn't seem to print. I wonder if Enzyme does some conversion of the types of the dimension sizes such that a == b no longer holds.

I also tried Infiltrator.jl and Debugger.jl but didn't have much luck.

wsmoses commented 5 months ago

@jgreener64 is that still the case, I thought on main that should now be fixed

jgreener64 commented 5 months ago

Still the case for me on 21b0762d with CUDA 5.3.4 and Julia 1.10.3.

roflmaostc commented 3 months ago

Is broadcasting with CUDA+Enzyme still a general issue or not?

I've got some multidimensional code which works with CPU but not CUDA

jgreener64 commented 3 months ago

It doesn't work at the minute from what I can tell.

With Enzyme 0.12.25, CUDA 5.4.2 and Julia 1.10.3 the forward mode error above has changed to:

warning found shared memory
ERROR: a Enzyme compilation failed.
Current scope: 
; Function Attrs: mustprogress willreturn
define void @preprocess_julia_partial_mapreduce_grid_48233_inner7(double "enzyme_type"="{[-1]:Float@double}" "enzymejl_parmtype"="140238631732192" "enzymejl_parmtype_ref"="0" %0, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %1, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140233465998160" "enzymejl_parmtype_ref"="0" %3, { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140238766152592" "enzymejl_parmtype_ref"="0" %4) local_unnamed_addr #12 !dbg !484 {
entry:
  %5 = alloca [2 x i64], align 8
  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !485
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !485
  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !485, !enzyme_inactive !12
  %6 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13
  %7 = call {}*** @julia.get_pgcstack() #13
  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !486
  call void @llvm.assume(i1 noundef %8) #13, !dbg !490
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !491, !range !117
  %10 = add nuw nsw i32 %9, 1, !dbg !497
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !498, !range !128
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !504, !range !137
  %13 = add nuw nsw i32 %12, 1, !dbg !510
  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !511
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !512, !range !148
  %15 = zext i32 %14 to i64, !dbg !518
  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !525
  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !526
  %unbox3.i = load i64, i64* %16, align 8, !dbg !527, !tbaa !170, !alias.scope !174, !noalias !177
  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !527
  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !529

L49.i:                                            ; preds = %entry
  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !530
  %19 = fadd double %0, %0, !dbg !532
  %unbox11.i = load i64, i64* %18, align 8, !dbg !535, !tbaa !170, !alias.scope !174, !noalias !177
  %20 = add i64 %unbox11.i, -1, !dbg !535
  %21 = zext i32 %11 to i64, !dbg !537
  %22 = mul i64 %20, %21, !dbg !539
  %23 = zext i32 %10 to i64, !dbg !541
  %24 = add i64 %22, %23, !dbg !543
  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !547

L97.i.lr.ph:                                      ; preds = %L49.i
  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !485
  %.not38 = icmp eq i64 %25, 1
  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*
  %27 = mul nuw nsw i64 %17, %21
  %28 = mul i64 %17, %21, !dbg !548
  br label %L97.i, !dbg !547

L97.i:                                            ; preds = %L97.i, %L97.i.lr.ph
  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]
  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]
  %29 = mul i64 %28, %iv, !dbg !548
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !548
  %30 = add i64 %24, %29, !dbg !548
  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !548
  %.op = add i64 %31, -1, !dbg !553
  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !553
  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !568
  %34 = load double, double addrspace(1)* %33, align 8, !dbg !568, !tbaa !248
  %35 = fadd double %value_phi12.i49, %34, !dbg !569
  %36 = add i64 %30, %27, !dbg !571
  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !547

L203.i.loopexit:                                  ; preds = %L97.i
  br label %L203.i, !dbg !573

L203.i:                                           ; preds = %L203.i.loopexit, %L49.i
  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]
  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !573
  %38 = icmp eq i32 %37, 32, !dbg !576
  call void @llvm.assume(i1 noundef %38) #13, !dbg !578
  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !579
  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !579
  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !580
  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !586
  %41 = trunc i64 %40 to i32, !dbg !589
  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %43 = zext i32 %42 to i64, !dbg !592
  %44 = shl nuw i64 %43, 32, !dbg !594
  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !589
  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %47 = zext i32 %46 to i64, !dbg !596
  %48 = or i64 %44, %47, !dbg !601
  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !580
  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !603
  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !580
  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !586
  %51 = trunc i64 %50 to i32, !dbg !589
  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %53 = zext i32 %52 to i64, !dbg !592
  %54 = shl nuw i64 %53, 32, !dbg !594
  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !589
  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %57 = zext i32 %56 to i64, !dbg !596
  %58 = or i64 %54, %57, !dbg !601
  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !580
  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !603
  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !580
  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !586
  %61 = trunc i64 %60 to i32, !dbg !589
  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %63 = zext i32 %62 to i64, !dbg !592
  %64 = shl nuw i64 %63, 32, !dbg !594
  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !589
  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %67 = zext i32 %66 to i64, !dbg !596
  %68 = or i64 %64, %67, !dbg !601
  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !580
  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !603
  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !580
  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !586
  %71 = trunc i64 %70 to i32, !dbg !589
  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %73 = zext i32 %72 to i64, !dbg !592
  %74 = shl nuw i64 %73, 32, !dbg !594
  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !589
  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %77 = zext i32 %76 to i64, !dbg !596
  %78 = or i64 %74, %77, !dbg !601
  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !580
  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !603
  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !580
  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !586
  %81 = trunc i64 %80 to i32, !dbg !589
  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !589
  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !579
  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !605
  br i1 %.not39, label %L254.i, label %L273.i, !dbg !608

L254.i:                                           ; preds = %L203.i
  %85 = zext i32 %82 to i64, !dbg !592
  %86 = shl nuw i64 %85, 32, !dbg !594
  %87 = zext i32 %84 to i64, !dbg !596
  %88 = or i64 %86, %87, !dbg !601
  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !580
  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !603
  %90 = add i32 %.fca.0.extract14, -1, !dbg !609
  %91 = sext i32 %90 to i64, !dbg !618
  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !618
  store double %89, double addrspace(3)* %92, align 8, !dbg !618, !tbaa !320, !noalias !619
  br label %L273.i, !dbg !622

L273.i:                                           ; preds = %L254.i, %L203.i
  call void @llvm.nvvm.barrier0() #13, !dbg !623
  %93 = lshr i32 %11, 5, !dbg !625
  %94 = and i32 %11, 992, !dbg !628
  %95 = icmp ne i32 %94, %11, !dbg !630
  %96 = zext i1 %95 to i32, !dbg !632
  %97 = add nuw nsw i32 %93, %96, !dbg !637
  %.not40.not = icmp ult i32 %9, %97, !dbg !639
  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !640

L302.i:                                           ; preds = %L273.i
  %98 = add i32 %.fca.1.extract15, -1, !dbg !641
  %99 = sext i32 %98 to i64, !dbg !650
  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !650
  %101 = load double, double addrspace(3)* %100, align 8, !dbg !650, !tbaa !320
  br label %L322.i, !dbg !650

L322.i:                                           ; preds = %L302.i, %L273.i
  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]
  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !651
  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !654

L331.i.preheader:                                 ; preds = %L322.i
  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !655
  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !661
  %103 = trunc i64 %102 to i32, !dbg !664
  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %105 = zext i32 %104 to i64, !dbg !667
  %106 = shl nuw i64 %105, 32, !dbg !669
  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !664
  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %109 = zext i32 %108 to i64, !dbg !671
  %110 = or i64 %106, %109, !dbg !676
  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !655
  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !678
  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !655
  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !661
  %113 = trunc i64 %112 to i32, !dbg !664
  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %115 = zext i32 %114 to i64, !dbg !667
  %116 = shl nuw i64 %115, 32, !dbg !669
  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !664
  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %119 = zext i32 %118 to i64, !dbg !671
  %120 = or i64 %116, %119, !dbg !676
  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !655
  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !678
  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !655
  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !661
  %123 = trunc i64 %122 to i32, !dbg !664
  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %125 = zext i32 %124 to i64, !dbg !667
  %126 = shl nuw i64 %125, 32, !dbg !669
  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !664
  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %129 = zext i32 %128 to i64, !dbg !671
  %130 = or i64 %126, %129, !dbg !676
  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !655
  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !678
  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !655
  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !661
  %133 = trunc i64 %132 to i32, !dbg !664
  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %135 = zext i32 %134 to i64, !dbg !667
  %136 = shl nuw i64 %135, 32, !dbg !669
  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !664
  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %139 = zext i32 %138 to i64, !dbg !671
  %140 = or i64 %136, %139, !dbg !676
  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !655
  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !678
  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !655
  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !661
  %143 = trunc i64 %142 to i32, !dbg !664
  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %145 = zext i32 %144 to i64, !dbg !667
  %146 = shl nuw i64 %145, 32, !dbg !669
  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !664
  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %149 = zext i32 %148 to i64, !dbg !671
  %150 = or i64 %146, %149, !dbg !676
  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !655
  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !678
  br label %L357.i, !dbg !680

L357.i:                                           ; preds = %L331.i.preheader, %L322.i
  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]
  %.not42 = icmp eq i32 %9, 0, !dbg !680
  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !683

L362.i:                                           ; preds = %L357.i
  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !684
  %153 = mul i64 %20, %152, !dbg !695
  %154 = add i64 %unbox3.i, -1, !dbg !699
  %155 = add i64 %154, %153, !dbg !700
  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !708
  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !708
  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !708, !tbaa !248, !noalias !619
  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !709

julia_partial_mapreduce_grid_48233_inner.exit:    ; preds = %L362.i, %L357.i, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !710
  ret void, !dbg !485
}

; Function Attrs: mustprogress willreturn
define void @preprocess_julia_partial_mapreduce_grid_48233_inner7(double "enzyme_type"="{[-1]:Float@double}" "enzymejl_parmtype"="140238631732192" "enzymejl_parmtype_ref"="0" %0, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %1, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140233465998160" "enzymejl_parmtype_ref"="0" %3, { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140238766152592" "enzymejl_parmtype_ref"="0" %4) local_unnamed_addr #12 !dbg !484 {
entry:
  %5 = alloca [2 x i64], align 8
  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !485
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !485
  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !485, !enzyme_inactive !12
  %6 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13
  %7 = call {}*** @julia.get_pgcstack() #13
  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !486
  call void @llvm.assume(i1 noundef %8) #13, !dbg !490
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !491, !range !117
  %10 = add nuw nsw i32 %9, 1, !dbg !497
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !498, !range !128
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !504, !range !137
  %13 = add nuw nsw i32 %12, 1, !dbg !510
  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !511
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !512, !range !148
  %15 = zext i32 %14 to i64, !dbg !518
  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !525
  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !526
  %unbox3.i = load i64, i64* %16, align 8, !dbg !527, !tbaa !170, !alias.scope !174, !noalias !177
  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !527
  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !529

L49.i:                                            ; preds = %entry
  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !530
  %19 = fadd double %0, %0, !dbg !532
  %unbox11.i = load i64, i64* %18, align 8, !dbg !535, !tbaa !170, !alias.scope !174, !noalias !177
  %20 = add i64 %unbox11.i, -1, !dbg !535
  %21 = zext i32 %11 to i64, !dbg !537
  %22 = mul i64 %20, %21, !dbg !539
  %23 = zext i32 %10 to i64, !dbg !541
  %24 = add i64 %22, %23, !dbg !543
  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !547

L97.i.lr.ph:                                      ; preds = %L49.i
  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !485
  %.not38 = icmp eq i64 %25, 1
  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*
  %27 = mul nuw nsw i64 %17, %21
  %28 = mul i64 %17, %21, !dbg !548
  br label %L97.i, !dbg !547

L97.i:                                            ; preds = %L97.i, %L97.i.lr.ph
  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]
  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]
  %29 = mul i64 %28, %iv, !dbg !548
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !548
  %30 = add i64 %24, %29, !dbg !548
  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !548
  %.op = add i64 %31, -1, !dbg !553
  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !553
  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !568
  %34 = load double, double addrspace(1)* %33, align 8, !dbg !568, !tbaa !248
  %35 = fadd double %value_phi12.i49, %34, !dbg !569
  %36 = add i64 %30, %27, !dbg !571
  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !547

L203.i.loopexit:                                  ; preds = %L97.i
  br label %L203.i, !dbg !573

L203.i:                                           ; preds = %L203.i.loopexit, %L49.i
  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]
  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !573
  %38 = icmp eq i32 %37, 32, !dbg !576
  call void @llvm.assume(i1 noundef %38) #13, !dbg !578
  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !579
  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !579
  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !580
  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !586
  %41 = trunc i64 %40 to i32, !dbg !589
  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %43 = zext i32 %42 to i64, !dbg !592
  %44 = shl nuw i64 %43, 32, !dbg !594
  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !589
  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %47 = zext i32 %46 to i64, !dbg !596
  %48 = or i64 %44, %47, !dbg !601
  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !580
  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !603
  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !580
  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !586
  %51 = trunc i64 %50 to i32, !dbg !589
  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %53 = zext i32 %52 to i64, !dbg !592
  %54 = shl nuw i64 %53, 32, !dbg !594
  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !589
  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %57 = zext i32 %56 to i64, !dbg !596
  %58 = or i64 %54, %57, !dbg !601
  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !580
  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !603
  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !580
  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !586
  %61 = trunc i64 %60 to i32, !dbg !589
  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %63 = zext i32 %62 to i64, !dbg !592
  %64 = shl nuw i64 %63, 32, !dbg !594
  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !589
  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %67 = zext i32 %66 to i64, !dbg !596
  %68 = or i64 %64, %67, !dbg !601
  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !580
  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !603
  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !580
  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !586
  %71 = trunc i64 %70 to i32, !dbg !589
  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %73 = zext i32 %72 to i64, !dbg !592
  %74 = shl nuw i64 %73, 32, !dbg !594
  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !589
  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %77 = zext i32 %76 to i64, !dbg !596
  %78 = or i64 %74, %77, !dbg !601
  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !580
  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !603
  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !580
  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !586
  %81 = trunc i64 %80 to i32, !dbg !589
  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !589
  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !579
  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !605
  br i1 %.not39, label %L254.i, label %L273.i, !dbg !608

L254.i:                                           ; preds = %L203.i
  %85 = zext i32 %82 to i64, !dbg !592
  %86 = shl nuw i64 %85, 32, !dbg !594
  %87 = zext i32 %84 to i64, !dbg !596
  %88 = or i64 %86, %87, !dbg !601
  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !580
  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !603
  %90 = add i32 %.fca.0.extract14, -1, !dbg !609
  %91 = sext i32 %90 to i64, !dbg !618
  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !618
  store double %89, double addrspace(3)* %92, align 8, !dbg !618, !tbaa !320, !noalias !619
  br label %L273.i, !dbg !622

L273.i:                                           ; preds = %L254.i, %L203.i
  call void @llvm.nvvm.barrier0() #13, !dbg !623
  %93 = lshr i32 %11, 5, !dbg !625
  %94 = and i32 %11, 992, !dbg !628
  %95 = icmp ne i32 %94, %11, !dbg !630
  %96 = zext i1 %95 to i32, !dbg !632
  %97 = add nuw nsw i32 %93, %96, !dbg !637
  %.not40.not = icmp ult i32 %9, %97, !dbg !639
  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !640

L302.i:                                           ; preds = %L273.i
  %98 = add i32 %.fca.1.extract15, -1, !dbg !641
  %99 = sext i32 %98 to i64, !dbg !650
  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !650
  %101 = load double, double addrspace(3)* %100, align 8, !dbg !650, !tbaa !320
  br label %L322.i, !dbg !650

L322.i:                                           ; preds = %L302.i, %L273.i
  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]
  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !651
  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !654

L331.i.preheader:                                 ; preds = %L322.i
  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !655
  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !661
  %103 = trunc i64 %102 to i32, !dbg !664
  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %105 = zext i32 %104 to i64, !dbg !667
  %106 = shl nuw i64 %105, 32, !dbg !669
  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !664
  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %109 = zext i32 %108 to i64, !dbg !671
  %110 = or i64 %106, %109, !dbg !676
  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !655
  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !678
  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !655
  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !661
  %113 = trunc i64 %112 to i32, !dbg !664
  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %115 = zext i32 %114 to i64, !dbg !667
  %116 = shl nuw i64 %115, 32, !dbg !669
  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !664
  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %119 = zext i32 %118 to i64, !dbg !671
  %120 = or i64 %116, %119, !dbg !676
  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !655
  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !678
  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !655
  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !661
  %123 = trunc i64 %122 to i32, !dbg !664
  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %125 = zext i32 %124 to i64, !dbg !667
  %126 = shl nuw i64 %125, 32, !dbg !669
  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !664
  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %129 = zext i32 %128 to i64, !dbg !671
  %130 = or i64 %126, %129, !dbg !676
  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !655
  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !678
  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !655
  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !661
  %133 = trunc i64 %132 to i32, !dbg !664
  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %135 = zext i32 %134 to i64, !dbg !667
  %136 = shl nuw i64 %135, 32, !dbg !669
  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !664
  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %139 = zext i32 %138 to i64, !dbg !671
  %140 = or i64 %136, %139, !dbg !676
  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !655
  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !678
  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !655
  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !661
  %143 = trunc i64 %142 to i32, !dbg !664
  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %145 = zext i32 %144 to i64, !dbg !667
  %146 = shl nuw i64 %145, 32, !dbg !669
  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !664
  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %149 = zext i32 %148 to i64, !dbg !671
  %150 = or i64 %146, %149, !dbg !676
  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !655
  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !678
  br label %L357.i, !dbg !680

L357.i:                                           ; preds = %L331.i.preheader, %L322.i
  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]
  %.not42 = icmp eq i32 %9, 0, !dbg !680
  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !683

L362.i:                                           ; preds = %L357.i
  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !684
  %153 = mul i64 %20, %152, !dbg !695
  %154 = add i64 %unbox3.i, -1, !dbg !699
  %155 = add i64 %154, %153, !dbg !700
  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !708
  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !708
  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !708, !tbaa !248, !noalias !619
  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !709

julia_partial_mapreduce_grid_48233_inner.exit:    ; preds = %L362.i, %L357.i, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !710
  ret void, !dbg !485
}

 constantarg[double %0] = 0 type: {[-1]:Float@double} - vals: {}
 constantarg[[1 x [1 x [1 x i64]]] %1] = 1 type: {[-1]:Integer} - vals: {}
 constantarg[[1 x [1 x [1 x i64]]] %2] = 1 type: {[-1]:Integer} - vals: {}
 constantarg[{ i8 addrspace(1)*, i64, [2 x i64], i64 } %3] = 0 type: {[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer} - vals: {}
 constantarg[{ [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4] = 0 type: {[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer} - vals: {}
 constantinst[  %5 = alloca [2 x i64], align 8] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !13] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !13] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %6 = bitcast [2 x i64]* %5 to i8*] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13] = 1 val:1 type: {}
 constantinst[  %7 = call {}*** @julia.get_pgcstack() #13] = 1 val:1 type: {}
 constantinst[  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !14] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call void @llvm.assume(i1 noundef %8) #13, !dbg !22] = 1 val:1 type: {}
 constantinst[  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !25, !range !38] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %10 = add nuw nsw i32 %9, 1, !dbg !39] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !41, !range !49] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !50, !range !58] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %13 = add nuw nsw i32 %12, 1, !dbg !59] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !60] = 1 val:1 type: {}
 constantinst[  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !61, !range !69] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %15 = zext i32 %14 to i64, !dbg !70] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !87] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !90] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %unbox3.i = load i64, i64* %16, align 8, !dbg !92, !tbaa !95, !alias.scope !99, !noalias !102] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !92] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !107] = 1 val:1 type: {}
 constantinst[  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !108] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %19 = fadd double %0, %0, !dbg !110] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %unbox11.i = load i64, i64* %18, align 8, !dbg !117, !tbaa !95, !alias.scope !99, !noalias !102] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %20 = add i64 %unbox11.i, -1, !dbg !117] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %21 = zext i32 %11 to i64, !dbg !120] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %22 = mul i64 %20, %21, !dbg !124] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %23 = zext i32 %10 to i64, !dbg !126] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %24 = add i64 %22, %23, !dbg !128] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !130] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !132] = 1 val:1 type: {}
 constantinst[  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !13] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not38 = icmp eq i64 %25, 1] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %27 = mul nuw nsw i64 %17, %21] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %28 = mul i64 %17, %21, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br label %L97.i, !dbg !132] = 1 val:1 type: {}
 constantinst[  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %29 = mul i64 %28, %iv, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %iv.next = add nuw nsw i64 %iv, 1, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %30 = add i64 %24, %29, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.op = add i64 %31, -1, !dbg !144] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !144] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !173] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %34 = load double, double addrspace(1)* %33, align 8, !dbg !173, !tbaa !174] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %35 = fadd double %value_phi12.i49, %34, !dbg !177] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %36 = add i64 %30, %27, !dbg !179] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !130] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !132] = 1 val:1 type: {}
 constantinst[  br label %L203.i, !dbg !181] = 1 val:1 type: {}
 constantinst[  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !181] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %38 = icmp eq i32 %37, 32, !dbg !186] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call void @llvm.assume(i1 noundef %38) #13, !dbg !189] = 1 val:1 type: {}
 constantinst[  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %41 = trunc i64 %40 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %43 = zext i32 %42 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %44 = shl nuw i64 %43, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %47 = zext i32 %46 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %48 = or i64 %44, %47, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %51 = trunc i64 %50 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %53 = zext i32 %52 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %54 = shl nuw i64 %53, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %57 = zext i32 %56 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %58 = or i64 %54, %57, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %61 = trunc i64 %60 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %63 = zext i32 %62 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %64 = shl nuw i64 %63, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %67 = zext i32 %66 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %68 = or i64 %64, %67, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %71 = trunc i64 %70 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %73 = zext i32 %72 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %74 = shl nuw i64 %73, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %77 = zext i32 %76 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %78 = or i64 %74, %77, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %81 = trunc i64 %80 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !227] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not39, label %L254.i, label %L273.i, !dbg !230] = 1 val:1 type: {}
 constantinst[  %85 = zext i32 %82 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %86 = shl nuw i64 %85, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %87 = zext i32 %84 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %88 = or i64 %86, %87, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %90 = add i32 %.fca.0.extract14, -1, !dbg !231] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %91 = sext i32 %90 to i64, !dbg !245] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !245] = 1 val:0 type: {[-1]:Pointer, [-1,0]:Float@double}
 constantinst[  store double %89, double addrspace(3)* %92, align 8, !dbg !245, !tbaa !246, !noalias !248] = 0 val:1 type: {}
 constantinst[  br label %L273.i, !dbg !251] = 1 val:1 type: {}
 constantinst[  call void @llvm.nvvm.barrier0() #13, !dbg !252] = 1 val:1 type: {}
 constantinst[  %93 = lshr i32 %11, 5, !dbg !256] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %94 = and i32 %11, 992, !dbg !260] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %95 = icmp ne i32 %94, %11, !dbg !262] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %96 = zext i1 %95 to i32, !dbg !265] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %97 = add nuw nsw i32 %93, %96, !dbg !272] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not40.not = icmp ult i32 %9, %97, !dbg !274] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !275] = 1 val:1 type: {}
 constantinst[  %98 = add i32 %.fca.1.extract15, -1, !dbg !276] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %99 = sext i32 %98 to i64, !dbg !285] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !285] = 1 val:0 type: {[-1]:Pointer, [-1,0]:Float@double}
 constantinst[  %101 = load double, double addrspace(3)* %100, align 8, !dbg !285, !tbaa !246] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  br label %L322.i, !dbg !285] = 1 val:1 type: {}
 constantinst[  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !286] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !289] = 1 val:1 type: {}
 constantinst[  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %103 = trunc i64 %102 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %105 = zext i32 %104 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %106 = shl nuw i64 %105, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %109 = zext i32 %108 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %110 = or i64 %106, %109, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %113 = trunc i64 %112 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %115 = zext i32 %114 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %116 = shl nuw i64 %115, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %119 = zext i32 %118 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %120 = or i64 %116, %119, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %123 = trunc i64 %122 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %125 = zext i32 %124 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %126 = shl nuw i64 %125, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %129 = zext i32 %128 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %130 = or i64 %126, %129, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %133 = trunc i64 %132 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %135 = zext i32 %134 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %136 = shl nuw i64 %135, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %139 = zext i32 %138 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %140 = or i64 %136, %139, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %143 = trunc i64 %142 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %145 = zext i32 %144 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %146 = shl nuw i64 %145, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %149 = zext i32 %148 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %150 = or i64 %146, %149, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  br label %L357.i, !dbg !315] = 1 val:1 type: {}
 constantinst[  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %.not42 = icmp eq i32 %9, 0, !dbg !315] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !318] = 1 val:1 type: {}
 constantinst[  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !319] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %153 = mul i64 %20, %152, !dbg !337] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %154 = add i64 %unbox3.i, -1, !dbg !342] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %155 = add i64 %154, %153, !dbg !343] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !351] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !351] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !351, !tbaa !174, !noalias !248] = 0 val:1 type: {}
 constantinst[  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !352] = 1 val:1 type: {}
 constantinst[  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !353] = 1 val:1 type: {}
 constantinst[  ret void, !dbg !13] = 1 val:1 type: {}
cannot handle unknown binary operator:   %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !202

Stacktrace:
 [1] >>>
   @ ./int.jl:530
 [2] >>>
   @ ./int.jl:538
 [3] shfl_recurse
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:86
 [4] shfl_recurse
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:97
 [5] shfl_down_sync (repeats 2 times)
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:78
 [6] reduce_warp
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:12
 [7] reduce_block
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:28
 [8] partial_mapreduce_grid
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:126
 [9] partial_mapreduce_grid
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:0
 was thrown during kernel execution on thread (1, 1, 1) in block (1, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

ERROR: Enzyme execution failed.
Enzyme: Not yet implemented forward for jl_eqtable_get
Stacktrace:
 [1] get
   @ ./iddict.jl:102
 [2] macro expansion
   @ ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:208
 [3] #_mapreduce#42
   @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:71

Stacktrace:
  [1] throwerr(cstr::Cstring)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1696
  [2] get
    @ ./iddict.jl:102 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:208 [inlined]
  [4] #_mapreduce#42
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:71
  [5] _mapreduce
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:33 [inlined]
  [6] mapreduce
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:28 [inlined]
  [7] _sum
    @ ./reducedim.jl:1015 [inlined]
  [8] _sum
    @ ./reducedim.jl:1014 [inlined]
  [9] sum
    @ ./reducedim.jl:1010 [inlined]
 [10] f
    @ ./REPL[2]:1 [inlined]
 [11] fwddiffejulia_f_2022wrap
    @ ./REPL[2]:0
 [12] macro expansion
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6673 [inlined]
 [13] enzyme_call
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6273 [inlined]
 [14] ForwardModeThunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6153 [inlined]
 [15] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:427 [inlined]
 [16] autodiff(::ForwardMode{FFIABI}, ::typeof(f), ::Type{Duplicated}, ::Duplicated{CuArray{…}}, ::Const{CuArray{…}})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:326
 [17] top-level scope
    @ REPL[7]:1
Some type information was truncated. Use `show(err)` to see complete types.

Reverse mode will be added in https://github.com/JuliaGPU/CUDA.jl/pull/2422 as I understand it.

roflmaostc commented 3 months ago

Ah cool to see some progress!

wsmoses commented 3 months ago

That PR should work from my local tests to enable broadcast cuda kerbel call etc, but needs some brief cuda.jl CI love .

Note that while the PR adds support for general heterogeneous code (either device or host code was separately previously supported), it still needs optimization after landing for full performance.

On Thu, Jul 25, 2024 at 10:36 AM Felix Wechsler @.***> wrote:

Ah cool to see some progress!

— Reply to this email directly, view it on GitHub https://github.com/EnzymeAD/Enzyme.jl/issues/1454#issuecomment-2250490391, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJTUXDDE75LWUSS36KI23TZOEEPHAVCNFSM6AAAAABH6RFYH2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJQGQ4TAMZZGE . You are receiving this because you commented.Message ID: <EnzymeAD/Enzyme. @.***>

wsmoses commented 3 months ago

@jgreener64 FYI your latest issue isn't a problem with broadcasting, but the reduction of the sum

wsmoses commented 3 months ago

Broadcasting reverse support is now landed in CUDA.jl. Note that like mentioned, this does not include reductions (like sum). Please opena n issue for that on cuda.jl and cc me.

jgreener64 commented 3 months ago

Issue opened at https://github.com/JuliaGPU/CUDA.jl/issues/2455.