`CuArray` broadcasting - Githubissues

Opening this to track progress in taking gradients through CuArray broadcasting. With Enzyme main (a68bf83) and CUDA v5.3.4:

using Enzyme, CUDA
f(x, y) = sum(x .+ y)
x = CuArray(rand(5))
y = CuArray(rand(5))
dx = CuArray([1.0, 0.0, 0.0, 0.0, 0.0])

For forward mode:

autodiff(Forward, f, Duplicated, Duplicated(x, dx), Const(y))

[3915854] signal (11.2): Segmentation fault
in expression starting at REPL[12]:1
unknown function (ip: 0x7f9c7fcaf1b0)
visitIntrinsicInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:3696
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
GetOrCreateShadowFunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/GradientUtils.cpp:4622
invertPointerM at /workspace/srcdir/Enzyme/enzyme/Enzyme/GradientUtils.cpp:5533
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4914
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
recursivelyHandleSubfunction at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:4950
visitCallInst at /workspace/srcdir/Enzyme/enzyme/Enzyme/AdjointGenerator.h:6492
visit at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/llvm/IR/InstVisitor.h:111 [inlined]
CreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/EnzymeLogic.cpp:4970
EnzymeCreateForwardDiff at /workspace/srcdir/Enzyme/enzyme/Enzyme/CApi.cpp:591
EnzymeCreateForwardDiff at /home/jgreener/.julia/dev/Enzyme/src/api.jl:168
unknown function (ip: 0x7f9cd004b93a)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
enzyme! at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:3261
unknown function (ip: 0x7f9cd00473e8)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#codegen#518 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5142
codegen at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:4549 [inlined]
_thunk at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5839
_thunk at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5839 [inlined]
cached_compilation at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5877 [inlined]
#563 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5943
#JuliaContext#147 at /home/jgreener/.julia/dev/GPUCompiler/src/driver.jl:52
unknown function (ip: 0x7f9cd01e2216)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
JuliaContext at /home/jgreener/.julia/dev/GPUCompiler/src/driver.jl:42
#s2042#562 at /home/jgreener/.julia/dev/Enzyme/src/compiler.jl:5895 [inlined]
#s2042#562 at ./none:0
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
GeneratedFunctionStub at ./boot.jl:602
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_call_staged at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/method.c:540
ijl_code_for_staged at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/method.c:593
get_staged at ./compiler/utilities.jl:123
retrieve_code_info at ./compiler/utilities.jl:135 [inlined]
InferenceState at ./compiler/inferencestate.jl:430
typeinf_edge at ./compiler/typeinfer.jl:920
abstract_call_method at ./compiler/abstractinterpretation.jl:629
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:95
abstract_call_known at ./compiler/abstractinterpretation.jl:2087
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_call at ./compiler/abstractinterpretation.jl:2162
abstract_call at ./compiler/abstractinterpretation.jl:2354
abstract_eval_call at ./compiler/abstractinterpretation.jl:2370
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2380
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2624
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2889
typeinf_local at ./compiler/abstractinterpretation.jl:3098
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3186
_typeinf at ./compiler/typeinfer.jl:247
typeinf at ./compiler/typeinfer.jl:216
typeinf_edge at ./compiler/typeinfer.jl:930
abstract_call_method at ./compiler/abstractinterpretation.jl:629
abstract_call_gf_by_type at ./compiler/abstractinterpretation.jl:95
abstract_call_known at ./compiler/abstractinterpretation.jl:2087
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_apply at ./compiler/abstractinterpretation.jl:1612
abstract_call_known at ./compiler/abstractinterpretation.jl:2004
abstract_call at ./compiler/abstractinterpretation.jl:2169
abstract_call at ./compiler/abstractinterpretation.jl:2162
abstract_call at ./compiler/abstractinterpretation.jl:2354
abstract_eval_call at ./compiler/abstractinterpretation.jl:2370
abstract_eval_statement_expr at ./compiler/abstractinterpretation.jl:2380
abstract_eval_statement at ./compiler/abstractinterpretation.jl:2624
abstract_eval_basic_statement at ./compiler/abstractinterpretation.jl:2913
typeinf_local at ./compiler/abstractinterpretation.jl:3098
typeinf_nocycle at ./compiler/abstractinterpretation.jl:3186
_typeinf at ./compiler/typeinfer.jl:247
typeinf at ./compiler/typeinfer.jl:216
typeinf_ext at ./compiler/typeinfer.jl:1051
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1082
typeinf_ext_toplevel at ./compiler/typeinfer.jl:1078
jfptr_typeinf_ext_toplevel_45276.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_type_infer at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:394
jl_generate_fptr_impl at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jitlayers.cpp:502
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2480 [inlined]
jl_compile_method_internal at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2368
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2886 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:617
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:934
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:877
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:579
eval_body at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:544
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/interpreter.c:775
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:934
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/toplevel.c:985
eval at ./boot.jl:385 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:150
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:246
#start_repl_backend#46 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:231
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:228
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#run_repl#59 at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:389
run_repl at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/usr/share/julia/stdlib/v1.10/REPL/src/REPL.jl:375
jfptr_run_repl_91745.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
#1013 at ./client.jl:432
jfptr_YY.1013_82712.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/builtins.c:812
#invokelatest#2 at ./essentials.jl:892 [inlined]
invokelatest at ./essentials.jl:889 [inlined]
run_main_repl at ./client.jl:416
exec_options at ./client.jl:333
_start at ./client.jl:552
jfptr__start_82738.1 at /home/jgreener/soft/julia/julia-1.10.2/lib/julia/sys.so (unknown line)
_jl_invoke at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:2894 [inlined]
ijl_apply_generic at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/gf.c:3076
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/julia.h:1982 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:582
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-release-1-dot-10/src/jlapi.c:731
main at julia (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 235658985 (Pool: 235384224; Big: 274761); GC: 112
Segmentation fault (core dumped)

For reverse mode:

autodiff(Reverse, f, Active, Duplicated(x, dx), Const(y))

┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
┌ Warning: active variables passed by value to jl_new_task are not yet supported
└ @ Enzyme.Compiler ~/.julia/dev/GPUCompiler/src/utils.jl:59
ERROR: Enzyme compilation failed.
Current scope:
; Function Attrs: mustprogress willreturn
define internal fastcc nonnull dereferenceable(16) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Pointer, [-1,8,0]:Pointer, [-1,8,8]:Integer, [-1,8,16]:Pointer}" {} addrspace(10)* @preprocess_julia___910_21700([1 x i32] addrspace(11)* nocapture noundef nonnull readonly align 4 dereferenceable(4) "enzyme_inactive" "enzyme_type"="{[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer}" "enzymejl_parmtype"="140596959426144" "enzymejl_parmtype_ref"="1" %0) unnamed_addr #428 !dbg !27759 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = call {}*** @julia.get_pgcstack()
  %3 = bitcast {}*** %2 to {}**
  %4 = getelementptr inbounds {}*, {}** %3, i64 -14
  %5 = getelementptr inbounds {}*, {}** %4, i64 16
  %6 = bitcast {}** %5 to i8**
  %7 = load i8*, i8** %6, align 8
  %8 = call noalias nonnull dereferenceable(16) dereferenceable_or_null(16) {} addrspace(10)* @julia.gc_alloc_obj({}** %4, i64 16, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595685148560 to {}*) to {} addrspace(10)*)), !enzyme_fromstack !409
  call void @zeroType.264({} addrspace(10)* %8, i8 0, i64 16), !enzyme_zerostack !374
  %9 = bitcast {} addrspace(10)* %8 to [2 x {} addrspace(10)*] addrspace(10)*, !enzyme_caststack !374
  %10 = bitcast {}*** %1 to {}**
  %11 = getelementptr inbounds {}*, {}** %10, i64 -14
  %12 = getelementptr inbounds {}*, {}** %11, i64 16
  %13 = bitcast {}** %12 to i8**
  %14 = load i8*, i8** %13, align 8
  %15 = call noalias nonnull dereferenceable(8) dereferenceable_or_null(8) {} addrspace(10)* @julia.gc_alloc_obj({}** %11, i64 8, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595682882576 to {}*) to {} addrspace(10)*)), !enzyme_fromstack !409
  call void @zeroType.265({} addrspace(10)* %15, i8 0, i64 8), !enzyme_zerostack !374
  %16 = bitcast {} addrspace(10)* %15 to [1 x {} addrspace(10)*] addrspace(10)*, !enzyme_caststack !374
  %17 = call {}*** @julia.get_pgcstack() #451
  %current_task119 = getelementptr inbounds {}**, {}*** %17, i64 -14
  %current_task1 = bitcast {}*** %current_task119 to {}**
  %ptls_field20 = getelementptr inbounds {}**, {}*** %17, i64 2
  %18 = bitcast {}*** %ptls_field20 to i64***
  %ptls_load2122 = load i64**, i64*** %18, align 8, !tbaa !375
  %19 = getelementptr inbounds i64*, i64** %ptls_load2122, i64 2
  %safepoint = load i64*, i64** %19, align 8, !tbaa !379
  fence syncscope("singlethread") seq_cst
  call void @julia.safepoint(i64* %safepoint) #451, !dbg !27760
  fence syncscope("singlethread") seq_cst
  %20 = getelementptr inbounds [1 x i32], [1 x i32] addrspace(11)* %0, i64 0, i64 0, !dbg !27761
  %unbox = load i32, i32 addrspace(11)* %20, align 4, !dbg !27765, !tbaa !379, !alias.scope !585, !noalias !586
  %21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !27767
  %box = call noalias nonnull dereferenceable(4) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 4, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972381280 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %22 = bitcast {} addrspace(10)* %box to i32 addrspace(10)*, !dbg !27764
  store i32 1, i32 addrspace(10)* %22, align 8, !dbg !27764, !tbaa !445, !alias.scope !395, !noalias !27768
  %box4 = call noalias nonnull dereferenceable(4) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 4, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972381008 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %23 = bitcast {} addrspace(10)* %box4 to i32 addrspace(10)*, !dbg !27764
  store i32 0, i32 addrspace(10)* %23, align 8, !dbg !27764, !tbaa !445, !alias.scope !395, !noalias !27768
  %box6 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972383920 to {}*) to {} addrspace(10)*)) #452, !dbg !27764
  %24 = bitcast {} addrspace(10)* %box6 to i8 addrspace(10)*, !dbg !27764
  %newstruct.sroa.0.0..sroa_cast = bitcast {} addrspace(10)* %box6 to i32 addrspace(10)*, !dbg !27764
  store i32 1, i32 addrspace(10)* %newstruct.sroa.0.0..sroa_cast, align 8, !dbg !27764, !tbaa !489, !alias.scope !490, !noalias !27771
  %newstruct.sroa.2.0..sroa_idx = getelementptr inbounds i8, i8 addrspace(10)* %24, i64 4, !dbg !27764
  %newstruct.sroa.2.0..sroa_cast = bitcast i8 addrspace(10)* %newstruct.sroa.2.0..sroa_idx to i32 addrspace(10)*, !dbg !27764
  store i32 %unbox, i32 addrspace(10)* %newstruct.sroa.2.0..sroa_cast, align 4, !dbg !27764, !tbaa !489, !alias.scope !490, !noalias !27771
  %25 = call noalias nonnull "enzyme_inactive" {} addrspace(10)* @ijl_box_int64(i64 noundef signext 0) #453, !dbg !27764
  %26 = call nonnull {} addrspace(10)* ({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)*, {} addrspace(10)*, ...) @julia.call({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)* noundef nonnull @ijl_apply_generic, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972383392 to {}*) to {} addrspace(10)*), {} addrspace(10)* nofree nonnull %box, {} addrspace(10)* nofree nonnull %box4, {} addrspace(10)* nofree nonnull %box6, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604561424448 to {}*) to {} addrspace(10)*), {} addrspace(10)* nonnull %25, {} addrspace(10)* nonnull %21) #454, !dbg !27764
  %newstruct8 = call noalias nonnull dereferenceable(88) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 88, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972389216 to {}*) to {} addrspace(10)*)) #452, !dbg !27772
  %27 = addrspacecast {} addrspace(10)* %newstruct8 to i8 addrspace(11)*, !dbg !27772
  %28 = addrspacecast {} addrspace(10)* %26 to i8 addrspace(11)*, !dbg !27772
  call void @llvm.memcpy.p11i8.p11i8.i64(i8 addrspace(11)* noundef align 8 dereferenceable(88) %27, i8 addrspace(11)* noundef align 1 dereferenceable(88) %28, i64 noundef 88, i1 noundef false) #451, !dbg !27772, !tbaa !392, !alias.scope !395, !noalias !27768
  %newstruct10 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972388992 to {}*) to {} addrspace(10)*)) #452, !dbg !27775
  %.fca.0.gep18 = getelementptr [2 x {} addrspace(10)*], [2 x {} addrspace(10)*] addrspace(10)* %9, i64 0, i64 0, !dbg !27778
  store {} addrspace(10)* %newstruct10, {} addrspace(10)* addrspace(10)* %.fca.0.gep18, align 8, !dbg !27778, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %8, {} addrspace(10)* %newstruct10), !dbg !27778
  %.fca.1.gep = getelementptr [2 x {} addrspace(10)*], [2 x {} addrspace(10)*] addrspace(10)* %9, i64 0, i64 1, !dbg !27778
  store {} addrspace(10)* %newstruct8, {} addrspace(10)* addrspace(10)* %.fca.1.gep, align 8, !dbg !27778, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %8, {} addrspace(10)* %newstruct8), !dbg !27778
  %29 = addrspacecast [2 x {} addrspace(10)*] addrspace(10)* %9 to [2 x {} addrspace(10)*] addrspace(11)*, !dbg !27778
  %30 = call fastcc i32 @julia__395_21708([2 x {} addrspace(10)*] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(16) %29) #451, !dbg !27778
  %31 = icmp eq i32 %30, 0, !dbg !27782
  br i1 %31, label %L32, label %L28, !dbg !27785

L28:                                              ; preds = %top
  call fastcc void @julia_throw_api_error_20396(i32 zeroext %30) #455, !dbg !27786
  unreachable, !dbg !27786

L32:                                              ; preds = %top
  %newstruct12 = call noalias nonnull dereferenceable(8) "enzyme_inactive" {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 8, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596972386592 to {}*) to {} addrspace(10)*)) #452, !dbg !27787
  %.fca.0.gep = getelementptr [1 x {} addrspace(10)*], [1 x {} addrspace(10)*] addrspace(10)* %16, i64 0, i64 0, !dbg !27791
  store {} addrspace(10)* %newstruct12, {} addrspace(10)* addrspace(10)* %.fca.0.gep, align 8, !dbg !27791, !noalias !27781
  call void ({} addrspace(10)*, ...) @julia.write_barrier({} addrspace(10)* %15, {} addrspace(10)* %newstruct12), !dbg !27791
  %32 = addrspacecast [1 x {} addrspace(10)*] addrspace(10)* %16 to [1 x {} addrspace(10)*] addrspace(11)*, !dbg !27791
  call fastcc void @julia_check_20440([1 x {} addrspace(10)*] addrspace(11)* nocapture nofree noundef nonnull readonly align 8 dereferenceable(8) %32) #451, !dbg !27791
  %33 = addrspacecast {} addrspace(10)* %newstruct12 to i64 addrspace(11)*, !dbg !27793
  %34 = load i64, i64 addrspace(11)* %33, align 8, !dbg !27793, !tbaa !420, !alias.scope !395, !noalias !398
  %.not = icmp eq i64 %34, 0, !dbg !27796
  br i1 %.not, label %L39, label %L41, !dbg !27795

L39:                                              ; preds = %L32
  call void @ijl_throw({} addrspace(12)* noundef addrspacecast ({}* inttoptr (i64 140604626762208 to {}*) to {} addrspace(12)*)) #455, !dbg !27795
  unreachable, !dbg !27795

L41:                                              ; preds = %L32
  %35 = call fastcc nonnull {} addrspace(10)* @julia_UniqueCuContext_20482(i64 zeroext %34) #451, !dbg !27798
  %36 = addrspacecast {} addrspace(10)* %newstruct10 to i64 addrspace(11)*, !dbg !27799
  %37 = load i64, i64 addrspace(11)* %36, align 8, !dbg !27799, !tbaa !420, !alias.scope !395, !noalias !398
  %newstruct15 = call noalias nonnull dereferenceable(16) {} addrspace(10)* @julia.gc_alloc_obj({}** nonnull %current_task1, i64 noundef 16, {} addrspace(10)* noundef addrspacecast ({}* inttoptr (i64 140596959507904 to {}*) to {} addrspace(10)*)) #452, !dbg !27801
  %38 = addrspacecast {} addrspace(10)* %newstruct15 to {} addrspace(10)* addrspace(11)*, !dbg !27801
  %39 = getelementptr inbounds {} addrspace(10)*, {} addrspace(10)* addrspace(11)* %38, i64 1, !dbg !27801
  store {} addrspace(10)* null, {} addrspace(10)* addrspace(11)* %39, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  %40 = addrspacecast {} addrspace(10)* %newstruct15 to i64 addrspace(11)*, !dbg !27801
  store i64 %37, i64 addrspace(11)* %40, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  %41 = addrspacecast {} addrspace(10)* %newstruct15 to i8 addrspace(11)*, !dbg !27801
  %42 = getelementptr inbounds i8, i8 addrspace(11)* %41, i64 8, !dbg !27801
  %43 = bitcast i8 addrspace(11)* %42 to {} addrspace(10)* addrspace(11)*, !dbg !27801
  store atomic {} addrspace(10)* %35, {} addrspace(10)* addrspace(11)* %43 release, align 8, !dbg !27801, !tbaa !420, !alias.scope !395, !noalias !27768
  ret {} addrspace(10)* %newstruct15, !dbg !27801
}

Illegal replace ficticious phi for:   %_replacementA14 = phi {} addrspace(10)* , !dbg !390 of   %21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !406
; Function Attrs: mustprogress willreturn
define internal fastcc nonnull dereferenceable(16) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Pointer, [-1,8,0]:Pointer, [-1,8,8]:Integer, [-1,8,16]:Pointer}" void @diffejulia___910_21700([1 x i32] addrspace(11)* nocapture readonly align 4 dereferenceable(4) "enzyme_inactive" "enzyme_type"="{[-1]:Pointer, [-1,0]:Integer, [-1,1]:Integer, [-1,2]:Integer, [-1,3]:Integer}" "enzymejl_parmtype"="140596959426144" "enzymejl_parmtype_ref"="1" %0, { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg) unnamed_addr #428 !dbg !32267 {
top:
  %1 = call {}*** @julia.get_pgcstack()
  %2 = call {}*** @julia.get_pgcstack()
  %_replacementA31 = phi {}**
  %_replacementA30 = phi {}**
  %_replacementA29 = phi {}**
  %_replacementA28 = phi i8**
  %_replacementA27 = phi i8*
  %_replacementA26 = phi {} addrspace(10)*
  %_replacementA25 = phi [2 x {} addrspace(10)*] addrspace(10)*
  %_replacementA24 = phi {}**
  %_replacementA23 = phi {}**
  %_replacementA22 = phi {}**
  %_replacementA21 = phi i8**
  %_replacementA20 = phi i8*
  %_replacementA19 = phi {} addrspace(10)*
  %_replacementA18 = phi [1 x {} addrspace(10)*] addrspace(10)*
  %3 = call {}*** @julia.get_pgcstack() #451
  %current_task119 = getelementptr inbounds {}**, {}*** %3, i64 -14
  %current_task1 = bitcast {}*** %current_task119 to {}**
  %ptls_field20_replacementA = phi {}***
  %_replacementA17 = phi i64***
  %ptls_load2122_replacementA = phi i64**
  %_replacementA16 = phi i64**
  %safepoint_replacementA = phi i64*
  %_replacementA15 = phi i32 addrspace(11)* , !dbg !32268
  %unbox_replacementA = phi i32 , !dbg !32272
  %_replacementA14 = phi {} addrspace(10)* , !dbg !32274
  %box_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA13 = phi i32 addrspace(10)* , !dbg !32271
  %box4_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA12 = phi i32 addrspace(10)* , !dbg !32271
  %box6_replacementA = phi {} addrspace(10)* , !dbg !32271
  %_replacementA11 = phi i8 addrspace(10)* , !dbg !32271
  %newstruct.sroa.0.0..sroa_cast_replacementA = phi i32 addrspace(10)* , !dbg !32271
  %newstruct.sroa.2.0..sroa_idx_replacementA = phi i8 addrspace(10)* , !dbg !32271
  %newstruct.sroa.2.0..sroa_cast_replacementA = phi i32 addrspace(10)* , !dbg !32271
  %_replacementA10 = phi {} addrspace(10)* , !dbg !32271
  %4 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 2, !dbg !32271
  %_replacementA9 = phi {} addrspace(10)* , !dbg !32271
  %_replacementA8 = phi i8 addrspace(11)* , !dbg !32275
  %_replacementA7 = phi i8 addrspace(11)* , !dbg !32275
  %newstruct10_replacementA = phi {} addrspace(10)* , !dbg !32278
  %.fca.0.gep18_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32281
  %.fca.1.gep_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32281
  %_replacementA6 = phi [2 x {} addrspace(10)*] addrspace(11)* , !dbg !32281
  %_replacementA5 = phi i32 , !dbg !32281
  %_replacementA = phi i1 , !dbg !32284
  br i1 true, label %L32, label %L28, !dbg !32287

L28:                                              ; preds = %top
  unreachable

L32:                                              ; preds = %top
  %newstruct12_replacementA = phi {} addrspace(10)* , !dbg !32288
  %.fca.0.gep_replacementA = phi {} addrspace(10)* addrspace(10)* , !dbg !32292
  %_replacementA33 = phi [1 x {} addrspace(10)*] addrspace(11)* , !dbg !32292
  %_replacementA32 = phi i64 addrspace(11)* , !dbg !32294
  %5 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 7, !dbg !32297
  %.not_replacementA = phi i1 , !dbg !32297
  br i1 false, label %L39, label %L41, !dbg !32296

L39:                                              ; preds = %L32
  unreachable

L41:                                              ; preds = %L32
  %tapeArg42 = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 5, !dbg !32299
  %_replacementA43 = phi {} addrspace(10)* , !dbg !32299
  %"'ip_phi3" = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 6, !dbg !32299
  %_replacementA41 = phi i64 addrspace(11)* , !dbg !32300
  %"newstruct15'mi" = extractvalue { {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, {} addrspace(10)*, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 }, {} addrspace(10)*, i64 } %tapeArg, 4, !dbg !32302
  %newstruct15_replacementA = phi {} addrspace(10)* , !dbg !32302
  %_replacementA39 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  %_replacementA38 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  %_replacementA37 = phi i64 addrspace(11)* , !dbg !32302
  %_replacementA36 = phi i8 addrspace(11)* , !dbg !32302
  %_replacementA35 = phi i8 addrspace(11)* , !dbg !32302
  %_replacementA34 = phi {} addrspace(10)* addrspace(11)* , !dbg !32302
  br label %invertL41, !dbg !32302

allocsForInversion:                               ; No predecessors!

inverttop:                                        ; preds = %invertL32
  %6 = call {} addrspace(10)* ({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)*, {} addrspace(10)*, ...) @julia.call({} addrspace(10)* ({} addrspace(10)*, {} addrspace(10)**, i32)* @ijl_apply_generic, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596842791248 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596298768528 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604603211760 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140595658237672 to {}*) to {} addrspace(10)*), {} addrspace(10)* %4, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140596972383392 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box4_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %box6_replacementA, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604561424448 to {}*) to {} addrspace(10)*), {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %_replacementA10, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*), {} addrspace(10)* %_replacementA14, {} addrspace(10)* addrspacecast ({}* inttoptr (i64 140604782354440 to {}*) to {} addrspace(10)*)), !dbg !32271
  fence syncscope("singlethread") seq_cst
  fence syncscope("singlethread") seq_cst
  ret void

invertL28:                                        ; No predecessors!

invertL32:                                        ; preds = %invertL41
  br label %inverttop

invertL39:                                        ; No predecessors!

invertL41:                                        ; preds = %L41
  call fastcc void @diffejulia_UniqueCuContext_20482(i64 zeroext %5, { {} addrspace(10)*, {} addrspace(10)*, i1, {} addrspace(10)* addrspace(10)*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, {} addrspace(10)* addrspace(10)*, i1*, i1, i1, i1*, {} addrspace(10)*, i1 } %tapeArg42), !dbg !32299
  br label %invertL32
}

LLVM.CallInst(%21 = call fastcc nonnull {} addrspace(10)* @julia__ntuple_21712() #451, !dbg !406)
LLVM.PHIInst(%_replacementA14 = phi {} addrspace(10)* , !dbg !390)

Stacktrace:
 [1] ntuple
   @ ./ntuple.jl:19
 [2] _
   @ ~/.julia/dev/CUDA/lib/cudadrv/pool.jl:18

Stacktrace:
  [1] julia_error(cstr::Cstring, val::Ptr{…}, errtype::Enzyme.API.ErrorType, data::Ptr{…}, data2::Ptr{…}, B::Ptr{…})
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1754
  [2] EnzymeCreatePrimalAndGradient(logic::Enzyme.Logic, todiff::LLVM.Function, retType::Enzyme.API.CDIFFE_TYPE, constant_args::Vector{…}, TA::Enzyme.TypeAnalysis, returnValue::Bool, dretUsed::Bool, mode::Enzyme.API.CDerivativeMode, width::Int64, additionalArg::Ptr{…}, forceAnonymousTape::Bool, typeInfo::Enzyme.FnTypeInfo, uncacheable_args::Vector{…}, augmented::Ptr{…}, atomicAdd::Bool)
    @ Enzyme.API ~/.julia/dev/Enzyme/src/api.jl:154
  [3] enzyme!(job::GPUCompiler.CompilerJob{…}, mod::LLVM.Module, primalf::LLVM.Function, TT::Type, mode::Enzyme.API.CDerivativeMode, width::Int64, parallel::Bool, actualRetType::Type, wrap::Bool, modifiedBetween::Tuple{…}, returnPrimal::Bool, expectedTapeType::Type, loweredArgs::Set{…}, boxedArgs::Set{…})
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:3249
  [4] codegen(output::Symbol, job::GPUCompiler.CompilerJob{…}; libraries::Bool, deferred_codegen::Bool, optimize::Bool, toplevel::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5142
  [5] codegen
    @ ~/.julia/dev/Enzyme/src/compiler.jl:4549 [inlined]
  [6] _thunk(job::GPUCompiler.CompilerJob{Enzyme.Compiler.EnzymeTarget, Enzyme.Compiler.EnzymeCompilerParams}, postopt::Bool)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5839
  [7] _thunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5839 [inlined]
  [8] cached_compilation
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5877 [inlined]
  [9] (::Enzyme.Compiler.var"#563#564"{…})(ctx::LLVM.Context)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:5943
 [10] JuliaContext(f::Enzyme.Compiler.var"#563#564"{…}; kwargs::@Kwargs{})
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:52
 [11] JuliaContext(f::Function)
    @ GPUCompiler ~/.julia/dev/GPUCompiler/src/driver.jl:42
 [12] #s2042#562
    @ ~/.julia/dev/Enzyme/src/compiler.jl:5895 [inlined]
 [13]
    @ Enzyme.Compiler ./none:0
 [14] (::Core.GeneratedFunctionStub)(::UInt64, ::LineNumberNode, ::Any, ::Vararg{Any})
    @ Core ./boot.jl:602
 [15] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:286 [inlined]
 [16] autodiff(::ReverseMode{false, FFIABI, false}, ::typeof(f), ::Type{Active}, ::Duplicated{CuArray{…}}, ::Const{CuArray{…}})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:303
 [17] top-level scope
    @ REPL[11]:1
 [18] top-level scope
    @ ~/.julia/dev/CUDA/src/initialization.jl:209
Some type information was truncated. Use `show(err)` to see complete types.

It doesn't work at the minute from what I can tell.

With Enzyme 0.12.25, CUDA 5.4.2 and Julia 1.10.3 the forward mode error above has changed to:

warning found shared memory
ERROR: a Enzyme compilation failed.
Current scope: 
; Function Attrs: mustprogress willreturn
define void @preprocess_julia_partial_mapreduce_grid_48233_inner7(double "enzyme_type"="{[-1]:Float@double}" "enzymejl_parmtype"="140238631732192" "enzymejl_parmtype_ref"="0" %0, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %1, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140233465998160" "enzymejl_parmtype_ref"="0" %3, { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140238766152592" "enzymejl_parmtype_ref"="0" %4) local_unnamed_addr #12 !dbg !484 {
entry:
  %5 = alloca [2 x i64], align 8
  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !485
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !485
  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !485, !enzyme_inactive !12
  %6 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13
  %7 = call {}*** @julia.get_pgcstack() #13
  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !486
  call void @llvm.assume(i1 noundef %8) #13, !dbg !490
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !491, !range !117
  %10 = add nuw nsw i32 %9, 1, !dbg !497
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !498, !range !128
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !504, !range !137
  %13 = add nuw nsw i32 %12, 1, !dbg !510
  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !511
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !512, !range !148
  %15 = zext i32 %14 to i64, !dbg !518
  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !525
  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !526
  %unbox3.i = load i64, i64* %16, align 8, !dbg !527, !tbaa !170, !alias.scope !174, !noalias !177
  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !527
  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !529

L49.i:                                            ; preds = %entry
  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !530
  %19 = fadd double %0, %0, !dbg !532
  %unbox11.i = load i64, i64* %18, align 8, !dbg !535, !tbaa !170, !alias.scope !174, !noalias !177
  %20 = add i64 %unbox11.i, -1, !dbg !535
  %21 = zext i32 %11 to i64, !dbg !537
  %22 = mul i64 %20, %21, !dbg !539
  %23 = zext i32 %10 to i64, !dbg !541
  %24 = add i64 %22, %23, !dbg !543
  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !547

L97.i.lr.ph:                                      ; preds = %L49.i
  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !485
  %.not38 = icmp eq i64 %25, 1
  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*
  %27 = mul nuw nsw i64 %17, %21
  %28 = mul i64 %17, %21, !dbg !548
  br label %L97.i, !dbg !547

L97.i:                                            ; preds = %L97.i, %L97.i.lr.ph
  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]
  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]
  %29 = mul i64 %28, %iv, !dbg !548
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !548
  %30 = add i64 %24, %29, !dbg !548
  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !548
  %.op = add i64 %31, -1, !dbg !553
  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !553
  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !568
  %34 = load double, double addrspace(1)* %33, align 8, !dbg !568, !tbaa !248
  %35 = fadd double %value_phi12.i49, %34, !dbg !569
  %36 = add i64 %30, %27, !dbg !571
  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !547

L203.i.loopexit:                                  ; preds = %L97.i
  br label %L203.i, !dbg !573

L203.i:                                           ; preds = %L203.i.loopexit, %L49.i
  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]
  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !573
  %38 = icmp eq i32 %37, 32, !dbg !576
  call void @llvm.assume(i1 noundef %38) #13, !dbg !578
  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !579
  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !579
  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !580
  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !586
  %41 = trunc i64 %40 to i32, !dbg !589
  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %43 = zext i32 %42 to i64, !dbg !592
  %44 = shl nuw i64 %43, 32, !dbg !594
  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !589
  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %47 = zext i32 %46 to i64, !dbg !596
  %48 = or i64 %44, %47, !dbg !601
  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !580
  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !603
  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !580
  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !586
  %51 = trunc i64 %50 to i32, !dbg !589
  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %53 = zext i32 %52 to i64, !dbg !592
  %54 = shl nuw i64 %53, 32, !dbg !594
  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !589
  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %57 = zext i32 %56 to i64, !dbg !596
  %58 = or i64 %54, %57, !dbg !601
  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !580
  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !603
  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !580
  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !586
  %61 = trunc i64 %60 to i32, !dbg !589
  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %63 = zext i32 %62 to i64, !dbg !592
  %64 = shl nuw i64 %63, 32, !dbg !594
  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !589
  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %67 = zext i32 %66 to i64, !dbg !596
  %68 = or i64 %64, %67, !dbg !601
  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !580
  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !603
  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !580
  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !586
  %71 = trunc i64 %70 to i32, !dbg !589
  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %73 = zext i32 %72 to i64, !dbg !592
  %74 = shl nuw i64 %73, 32, !dbg !594
  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !589
  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %77 = zext i32 %76 to i64, !dbg !596
  %78 = or i64 %74, %77, !dbg !601
  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !580
  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !603
  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !580
  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !586
  %81 = trunc i64 %80 to i32, !dbg !589
  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !589
  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !579
  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !605
  br i1 %.not39, label %L254.i, label %L273.i, !dbg !608

L254.i:                                           ; preds = %L203.i
  %85 = zext i32 %82 to i64, !dbg !592
  %86 = shl nuw i64 %85, 32, !dbg !594
  %87 = zext i32 %84 to i64, !dbg !596
  %88 = or i64 %86, %87, !dbg !601
  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !580
  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !603
  %90 = add i32 %.fca.0.extract14, -1, !dbg !609
  %91 = sext i32 %90 to i64, !dbg !618
  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !618
  store double %89, double addrspace(3)* %92, align 8, !dbg !618, !tbaa !320, !noalias !619
  br label %L273.i, !dbg !622

L273.i:                                           ; preds = %L254.i, %L203.i
  call void @llvm.nvvm.barrier0() #13, !dbg !623
  %93 = lshr i32 %11, 5, !dbg !625
  %94 = and i32 %11, 992, !dbg !628
  %95 = icmp ne i32 %94, %11, !dbg !630
  %96 = zext i1 %95 to i32, !dbg !632
  %97 = add nuw nsw i32 %93, %96, !dbg !637
  %.not40.not = icmp ult i32 %9, %97, !dbg !639
  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !640

L302.i:                                           ; preds = %L273.i
  %98 = add i32 %.fca.1.extract15, -1, !dbg !641
  %99 = sext i32 %98 to i64, !dbg !650
  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !650
  %101 = load double, double addrspace(3)* %100, align 8, !dbg !650, !tbaa !320
  br label %L322.i, !dbg !650

L322.i:                                           ; preds = %L302.i, %L273.i
  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]
  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !651
  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !654

L331.i.preheader:                                 ; preds = %L322.i
  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !655
  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !661
  %103 = trunc i64 %102 to i32, !dbg !664
  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %105 = zext i32 %104 to i64, !dbg !667
  %106 = shl nuw i64 %105, 32, !dbg !669
  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !664
  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %109 = zext i32 %108 to i64, !dbg !671
  %110 = or i64 %106, %109, !dbg !676
  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !655
  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !678
  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !655
  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !661
  %113 = trunc i64 %112 to i32, !dbg !664
  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %115 = zext i32 %114 to i64, !dbg !667
  %116 = shl nuw i64 %115, 32, !dbg !669
  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !664
  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %119 = zext i32 %118 to i64, !dbg !671
  %120 = or i64 %116, %119, !dbg !676
  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !655
  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !678
  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !655
  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !661
  %123 = trunc i64 %122 to i32, !dbg !664
  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %125 = zext i32 %124 to i64, !dbg !667
  %126 = shl nuw i64 %125, 32, !dbg !669
  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !664
  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %129 = zext i32 %128 to i64, !dbg !671
  %130 = or i64 %126, %129, !dbg !676
  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !655
  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !678
  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !655
  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !661
  %133 = trunc i64 %132 to i32, !dbg !664
  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %135 = zext i32 %134 to i64, !dbg !667
  %136 = shl nuw i64 %135, 32, !dbg !669
  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !664
  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %139 = zext i32 %138 to i64, !dbg !671
  %140 = or i64 %136, %139, !dbg !676
  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !655
  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !678
  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !655
  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !661
  %143 = trunc i64 %142 to i32, !dbg !664
  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %145 = zext i32 %144 to i64, !dbg !667
  %146 = shl nuw i64 %145, 32, !dbg !669
  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !664
  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %149 = zext i32 %148 to i64, !dbg !671
  %150 = or i64 %146, %149, !dbg !676
  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !655
  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !678
  br label %L357.i, !dbg !680

L357.i:                                           ; preds = %L331.i.preheader, %L322.i
  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]
  %.not42 = icmp eq i32 %9, 0, !dbg !680
  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !683

L362.i:                                           ; preds = %L357.i
  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !684
  %153 = mul i64 %20, %152, !dbg !695
  %154 = add i64 %unbox3.i, -1, !dbg !699
  %155 = add i64 %154, %153, !dbg !700
  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !708
  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !708
  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !708, !tbaa !248, !noalias !619
  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !709

julia_partial_mapreduce_grid_48233_inner.exit:    ; preds = %L362.i, %L357.i, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !710
  ret void, !dbg !485
}

; Function Attrs: mustprogress willreturn
define void @preprocess_julia_partial_mapreduce_grid_48233_inner7(double "enzyme_type"="{[-1]:Float@double}" "enzymejl_parmtype"="140238631732192" "enzymejl_parmtype_ref"="0" %0, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %1, [1 x [1 x [1 x i64]]] "enzyme_type"="{[-1]:Integer}" "enzymejl_parmtype"="140238577761536" "enzymejl_parmtype_ref"="0" %2, { i8 addrspace(1)*, i64, [2 x i64], i64 } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140233465998160" "enzymejl_parmtype_ref"="0" %3, { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } "enzyme_type"="{[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer}" "enzymejl_parmtype"="140238766152592" "enzymejl_parmtype_ref"="0" %4) local_unnamed_addr #12 !dbg !484 {
entry:
  %5 = alloca [2 x i64], align 8
  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !485
  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !485, !enzyme_inactive !12
  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !485
  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !485, !enzyme_inactive !12
  %6 = bitcast [2 x i64]* %5 to i8*
  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13
  %7 = call {}*** @julia.get_pgcstack() #13
  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !486
  call void @llvm.assume(i1 noundef %8) #13, !dbg !490
  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !491, !range !117
  %10 = add nuw nsw i32 %9, 1, !dbg !497
  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !498, !range !128
  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !504, !range !137
  %13 = add nuw nsw i32 %12, 1, !dbg !510
  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !511
  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !512, !range !148
  %15 = zext i32 %14 to i64, !dbg !518
  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !525
  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !526
  %unbox3.i = load i64, i64* %16, align 8, !dbg !527, !tbaa !170, !alias.scope !174, !noalias !177
  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !527
  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !529

L49.i:                                            ; preds = %entry
  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !530
  %19 = fadd double %0, %0, !dbg !532
  %unbox11.i = load i64, i64* %18, align 8, !dbg !535, !tbaa !170, !alias.scope !174, !noalias !177
  %20 = add i64 %unbox11.i, -1, !dbg !535
  %21 = zext i32 %11 to i64, !dbg !537
  %22 = mul i64 %20, %21, !dbg !539
  %23 = zext i32 %10 to i64, !dbg !541
  %24 = add i64 %22, %23, !dbg !543
  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !547

L97.i.lr.ph:                                      ; preds = %L49.i
  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !485
  %.not38 = icmp eq i64 %25, 1
  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*
  %27 = mul nuw nsw i64 %17, %21
  %28 = mul i64 %17, %21, !dbg !548
  br label %L97.i, !dbg !547

L97.i:                                            ; preds = %L97.i, %L97.i.lr.ph
  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]
  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]
  %29 = mul i64 %28, %iv, !dbg !548
  %iv.next = add nuw nsw i64 %iv, 1, !dbg !548
  %30 = add i64 %24, %29, !dbg !548
  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !548
  %.op = add i64 %31, -1, !dbg !553
  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !553
  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !568
  %34 = load double, double addrspace(1)* %33, align 8, !dbg !568, !tbaa !248
  %35 = fadd double %value_phi12.i49, %34, !dbg !569
  %36 = add i64 %30, %27, !dbg !571
  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !545
  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !547

L203.i.loopexit:                                  ; preds = %L97.i
  br label %L203.i, !dbg !573

L203.i:                                           ; preds = %L203.i.loopexit, %L49.i
  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]
  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !573
  %38 = icmp eq i32 %37, 32, !dbg !576
  call void @llvm.assume(i1 noundef %38) #13, !dbg !578
  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !579
  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !579
  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !580
  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !586
  %41 = trunc i64 %40 to i32, !dbg !589
  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %43 = zext i32 %42 to i64, !dbg !592
  %44 = shl nuw i64 %43, 32, !dbg !594
  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !589
  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !590
  %47 = zext i32 %46 to i64, !dbg !596
  %48 = or i64 %44, %47, !dbg !601
  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !580
  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !603
  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !580
  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !586
  %51 = trunc i64 %50 to i32, !dbg !589
  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %53 = zext i32 %52 to i64, !dbg !592
  %54 = shl nuw i64 %53, 32, !dbg !594
  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !589
  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !590
  %57 = zext i32 %56 to i64, !dbg !596
  %58 = or i64 %54, %57, !dbg !601
  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !580
  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !603
  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !580
  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !586
  %61 = trunc i64 %60 to i32, !dbg !589
  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %63 = zext i32 %62 to i64, !dbg !592
  %64 = shl nuw i64 %63, 32, !dbg !594
  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !589
  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !590
  %67 = zext i32 %66 to i64, !dbg !596
  %68 = or i64 %64, %67, !dbg !601
  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !580
  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !603
  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !580
  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !586
  %71 = trunc i64 %70 to i32, !dbg !589
  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %73 = zext i32 %72 to i64, !dbg !592
  %74 = shl nuw i64 %73, 32, !dbg !594
  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !589
  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !590
  %77 = zext i32 %76 to i64, !dbg !596
  %78 = or i64 %74, %77, !dbg !601
  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !580
  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !603
  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !580
  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !586
  %81 = trunc i64 %80 to i32, !dbg !589
  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !589
  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !590
  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !579
  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !605
  br i1 %.not39, label %L254.i, label %L273.i, !dbg !608

L254.i:                                           ; preds = %L203.i
  %85 = zext i32 %82 to i64, !dbg !592
  %86 = shl nuw i64 %85, 32, !dbg !594
  %87 = zext i32 %84 to i64, !dbg !596
  %88 = or i64 %86, %87, !dbg !601
  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !580
  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !603
  %90 = add i32 %.fca.0.extract14, -1, !dbg !609
  %91 = sext i32 %90 to i64, !dbg !618
  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !618
  store double %89, double addrspace(3)* %92, align 8, !dbg !618, !tbaa !320, !noalias !619
  br label %L273.i, !dbg !622

L273.i:                                           ; preds = %L254.i, %L203.i
  call void @llvm.nvvm.barrier0() #13, !dbg !623
  %93 = lshr i32 %11, 5, !dbg !625
  %94 = and i32 %11, 992, !dbg !628
  %95 = icmp ne i32 %94, %11, !dbg !630
  %96 = zext i1 %95 to i32, !dbg !632
  %97 = add nuw nsw i32 %93, %96, !dbg !637
  %.not40.not = icmp ult i32 %9, %97, !dbg !639
  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !640

L302.i:                                           ; preds = %L273.i
  %98 = add i32 %.fca.1.extract15, -1, !dbg !641
  %99 = sext i32 %98 to i64, !dbg !650
  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !650
  %101 = load double, double addrspace(3)* %100, align 8, !dbg !650, !tbaa !320
  br label %L322.i, !dbg !650

L322.i:                                           ; preds = %L302.i, %L273.i
  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]
  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !651
  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !654

L331.i.preheader:                                 ; preds = %L322.i
  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !655
  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !661
  %103 = trunc i64 %102 to i32, !dbg !664
  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %105 = zext i32 %104 to i64, !dbg !667
  %106 = shl nuw i64 %105, 32, !dbg !669
  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !664
  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !665
  %109 = zext i32 %108 to i64, !dbg !671
  %110 = or i64 %106, %109, !dbg !676
  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !655
  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !678
  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !655
  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !661
  %113 = trunc i64 %112 to i32, !dbg !664
  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %115 = zext i32 %114 to i64, !dbg !667
  %116 = shl nuw i64 %115, 32, !dbg !669
  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !664
  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !665
  %119 = zext i32 %118 to i64, !dbg !671
  %120 = or i64 %116, %119, !dbg !676
  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !655
  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !678
  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !655
  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !661
  %123 = trunc i64 %122 to i32, !dbg !664
  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %125 = zext i32 %124 to i64, !dbg !667
  %126 = shl nuw i64 %125, 32, !dbg !669
  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !664
  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !665
  %129 = zext i32 %128 to i64, !dbg !671
  %130 = or i64 %126, %129, !dbg !676
  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !655
  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !678
  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !655
  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !661
  %133 = trunc i64 %132 to i32, !dbg !664
  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %135 = zext i32 %134 to i64, !dbg !667
  %136 = shl nuw i64 %135, 32, !dbg !669
  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !664
  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !665
  %139 = zext i32 %138 to i64, !dbg !671
  %140 = or i64 %136, %139, !dbg !676
  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !655
  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !678
  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !655
  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !661
  %143 = trunc i64 %142 to i32, !dbg !664
  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %145 = zext i32 %144 to i64, !dbg !667
  %146 = shl nuw i64 %145, 32, !dbg !669
  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !664
  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !665
  %149 = zext i32 %148 to i64, !dbg !671
  %150 = or i64 %146, %149, !dbg !676
  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !655
  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !678
  br label %L357.i, !dbg !680

L357.i:                                           ; preds = %L331.i.preheader, %L322.i
  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]
  %.not42 = icmp eq i32 %9, 0, !dbg !680
  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !683

L362.i:                                           ; preds = %L357.i
  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !684
  %153 = mul i64 %20, %152, !dbg !695
  %154 = add i64 %unbox3.i, -1, !dbg !699
  %155 = add i64 %154, %153, !dbg !700
  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !708
  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !708
  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !708, !tbaa !248, !noalias !619
  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !709

julia_partial_mapreduce_grid_48233_inner.exit:    ; preds = %L362.i, %L357.i, %entry
  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !710
  ret void, !dbg !485
}

 constantarg[double %0] = 0 type: {[-1]:Float@double} - vals: {}
 constantarg[[1 x [1 x [1 x i64]]] %1] = 1 type: {[-1]:Integer} - vals: {}
 constantarg[[1 x [1 x [1 x i64]]] %2] = 1 type: {[-1]:Integer} - vals: {}
 constantarg[{ i8 addrspace(1)*, i64, [2 x i64], i64 } %3] = 0 type: {[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer} - vals: {}
 constantarg[{ [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4] = 0 type: {[0]:Pointer, [0,-1]:Float@double, [8]:Integer, [9]:Integer, [10]:Integer, [11]:Integer, [12]:Integer, [13]:Integer, [14]:Integer, [15]:Integer, [16]:Integer, [17]:Integer, [18]:Integer, [19]:Integer, [20]:Integer, [21]:Integer, [22]:Integer, [23]:Integer, [24]:Integer, [25]:Integer, [26]:Integer, [27]:Integer, [28]:Integer, [29]:Integer, [30]:Integer, [31]:Integer, [32]:Integer, [33]:Integer, [34]:Integer, [35]:Integer, [36]:Integer, [37]:Integer, [38]:Integer, [39]:Integer} - vals: {}
 constantinst[  %5 = alloca [2 x i64], align 8] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %.fca.0.0.0.extract13 = extractvalue [1 x [1 x [1 x i64]]] %1, 0, 0, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.0.0.extract9 = extractvalue [1 x [1 x [1 x i64]]] %2, 0, 0, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 0, !dbg !13] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %.fca.2.0.extract = extractvalue { i8 addrspace(1)*, i64, [2 x i64], i64 } %3, 2, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.0.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 0, !dbg !13] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %.fca.0.0.2.0.extract = extractvalue { [1 x { i8 addrspace(1)*, i64, [1 x i64], i64 }], [1 x [1 x i64]] } %4, 0, 0, 2, 0, !dbg !13, !enzyme_inactive !12] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %6 = bitcast [2 x i64]* %5 to i8*] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  call void @llvm.lifetime.start.p0i8(i64 noundef 16, i8* noundef nonnull align 8 dereferenceable(16) %6) #13] = 1 val:1 type: {}
 constantinst[  %7 = call {}*** @julia.get_pgcstack() #13] = 1 val:1 type: {}
 constantinst[  %8 = icmp sgt i64 %.fca.0.0.0.extract9, 0, !dbg !14] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call void @llvm.assume(i1 noundef %8) #13, !dbg !22] = 1 val:1 type: {}
 constantinst[  %9 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #13, !dbg !25, !range !38] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %10 = add nuw nsw i32 %9, 1, !dbg !39] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %11 = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() #13, !dbg !41, !range !49] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %12 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() #13, !dbg !50, !range !58] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %13 = add nuw nsw i32 %12, 1, !dbg !59] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call fastcc void @julia_fldmod1_48281([2 x i64]* noalias nocapture nofree noundef nonnull writeonly sret([2 x i64]) align 8 dereferenceable(16) %5, i32 signext %13, i64 signext %.fca.0.0.0.extract9) #13, !dbg !60] = 1 val:1 type: {}
 constantinst[  %14 = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() #13, !dbg !61, !range !69] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %15 = zext i32 %14 to i64, !dbg !70] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 1, !dbg !87] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %17 = udiv i64 %15, %.fca.0.0.0.extract9, !dbg !90] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %unbox3.i = load i64, i64* %16, align 8, !dbg !92, !tbaa !95, !alias.scope !99, !noalias !102] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not = icmp sgt i64 %unbox3.i, %.fca.0.0.0.extract9, !dbg !92] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not, label %julia_partial_mapreduce_grid_48233_inner.exit, label %L49.i, !dbg !107] = 1 val:1 type: {}
 constantinst[  %18 = getelementptr inbounds [2 x i64], [2 x i64]* %5, i64 0, i64 0, !dbg !108] = 1 val:1 type: {[-1]:Pointer, [-1,-1]:Integer}
 constantinst[  %19 = fadd double %0, %0, !dbg !110] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %unbox11.i = load i64, i64* %18, align 8, !dbg !117, !tbaa !95, !alias.scope !99, !noalias !102] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %20 = add i64 %unbox11.i, -1, !dbg !117] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %21 = zext i32 %11 to i64, !dbg !120] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %22 = mul i64 %20, %21, !dbg !124] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %23 = zext i32 %10 to i64, !dbg !126] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %24 = add i64 %22, %23, !dbg !128] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not3647 = icmp sgt i64 %24, %.fca.0.0.0.extract13, !dbg !130] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not3647, label %L203.i, label %L97.i.lr.ph, !dbg !132] = 1 val:1 type: {}
 constantinst[  %25 = call i64 @llvm.smax.i64(i64 %.fca.0.0.2.0.extract, i64 noundef 0) #13, !dbg !13] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not38 = icmp eq i64 %25, 1] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %26 = bitcast i8 addrspace(1)* %.fca.0.0.0.extract to double addrspace(1)*] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %27 = mul nuw nsw i64 %17, %21] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %28 = mul i64 %17, %21, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br label %L97.i, !dbg !132] = 1 val:1 type: {}
 constantinst[  %iv = phi i64 [ %iv.next, %L97.i ], [ 0, %L97.i.lr.ph ]] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %value_phi12.i49 = phi double [ %19, %L97.i.lr.ph ], [ %35, %L97.i ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %29 = mul i64 %28, %iv, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %iv.next = add nuw nsw i64 %iv, 1, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %30 = add i64 %24, %29, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %31 = call i64 @llvm.smax.i64(i64 %30, i64 %unbox3.i) #13, !dbg !133] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.op = add i64 %31, -1, !dbg !144] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %32 = select i1 %.not38, i64 0, i64 %.op, !dbg !144] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %33 = getelementptr inbounds double, double addrspace(1)* %26, i64 %32, !dbg !173] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %34 = load double, double addrspace(1)* %33, align 8, !dbg !173, !tbaa !174] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %35 = fadd double %value_phi12.i49, %34, !dbg !177] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %36 = add i64 %30, %27, !dbg !179] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not36 = icmp sgt i64 %36, %.fca.0.0.0.extract13, !dbg !130] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not36, label %L203.i.loopexit, label %L97.i, !dbg !132] = 1 val:1 type: {}
 constantinst[  br label %L203.i, !dbg !181] = 1 val:1 type: {}
 constantinst[  %value_phi12.i.lcssa = phi double [ %19, %L49.i ], [ %35, %L203.i.loopexit ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %37 = call i32 @llvm.nvvm.read.ptx.sreg.warpsize() #13, !dbg !181] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %38 = icmp eq i32 %37, 32, !dbg !186] = 1 val:1 type: {[-1]:Integer}
 constantinst[  call void @llvm.assume(i1 noundef %38) #13, !dbg !189] = 1 val:1 type: {}
 constantinst[  %39 = call fastcc [2 x i32] @julia_fldmod1_48268(i32 signext %10) #14, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.fca.0.extract14 = extractvalue [2 x i32] %39, 0, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %bitcast_coercion.i = bitcast double %value_phi12.i.lcssa to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %41 = trunc i64 %40 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %42 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %41, i32 noundef 1, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %43 = zext i32 %42 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %44 = shl nuw i64 %43, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %45 = trunc i64 %bitcast_coercion.i to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %46 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %45, i32 noundef 1, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %47 = zext i32 %46 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %48 = or i64 %44, %47, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i = bitcast i64 %48 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %49 = fadd double %value_phi12.i.lcssa, %bitcast_coercion35.i, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.1 = bitcast double %49 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %50 = lshr i64 %bitcast_coercion.i.1, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %51 = trunc i64 %50 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %52 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %51, i32 noundef 2, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %53 = zext i32 %52 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %54 = shl nuw i64 %53, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %55 = trunc i64 %bitcast_coercion.i.1 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %56 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %55, i32 noundef 2, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %57 = zext i32 %56 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %58 = or i64 %54, %57, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.1 = bitcast i64 %58 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %59 = fadd double %49, %bitcast_coercion35.i.1, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.2 = bitcast double %59 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %60 = lshr i64 %bitcast_coercion.i.2, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %61 = trunc i64 %60 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %62 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %61, i32 noundef 4, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %63 = zext i32 %62 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %64 = shl nuw i64 %63, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %65 = trunc i64 %bitcast_coercion.i.2 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %66 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %65, i32 noundef 4, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %67 = zext i32 %66 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %68 = or i64 %64, %67, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.2 = bitcast i64 %68 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %69 = fadd double %59, %bitcast_coercion35.i.2, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.3 = bitcast double %69 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %70 = lshr i64 %bitcast_coercion.i.3, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %71 = trunc i64 %70 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %72 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %71, i32 noundef 8, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %73 = zext i32 %72 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %74 = shl nuw i64 %73, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %75 = trunc i64 %bitcast_coercion.i.3 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %76 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %75, i32 noundef 8, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %77 = zext i32 %76 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %78 = or i64 %74, %77, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.3 = bitcast i64 %78 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %79 = fadd double %69, %bitcast_coercion35.i.3, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion.i.4 = bitcast double %79 to i64, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %80 = lshr i64 %bitcast_coercion.i.4, 32, !dbg !202] = 0 val:0 type: {}
 constantinst[  %81 = trunc i64 %80 to i32, !dbg !206] = 0 val:0 type: {}
 constantinst[  %82 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %81, i32 noundef 16, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %83 = trunc i64 %bitcast_coercion.i.4 to i32, !dbg !206] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %84 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %83, i32 noundef 16, i32 noundef 31) #13, !dbg !207] = 0 val:0 type: {}
 constantinst[  %.fca.1.extract15 = extractvalue [2 x i32] %39, 1, !dbg !190] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not39 = icmp eq i32 %.fca.1.extract15, 1, !dbg !227] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not39, label %L254.i, label %L273.i, !dbg !230] = 1 val:1 type: {}
 constantinst[  %85 = zext i32 %82 to i64, !dbg !210] = 0 val:0 type: {}
 constantinst[  %86 = shl nuw i64 %85, 32, !dbg !214] = 0 val:0 type: {}
 constantinst[  %87 = zext i32 %84 to i64, !dbg !217] = 0 val:0 type: {}
 constantinst[  %88 = or i64 %86, %87, !dbg !223] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion35.i.4 = bitcast i64 %88 to double, !dbg !191] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %89 = fadd double %79, %bitcast_coercion35.i.4, !dbg !225] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %90 = add i32 %.fca.0.extract14, -1, !dbg !231] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %91 = sext i32 %90 to i64, !dbg !245] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %92 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %91, !dbg !245] = 1 val:0 type: {[-1]:Pointer, [-1,0]:Float@double}
 constantinst[  store double %89, double addrspace(3)* %92, align 8, !dbg !245, !tbaa !246, !noalias !248] = 0 val:1 type: {}
 constantinst[  br label %L273.i, !dbg !251] = 1 val:1 type: {}
 constantinst[  call void @llvm.nvvm.barrier0() #13, !dbg !252] = 1 val:1 type: {}
 constantinst[  %93 = lshr i32 %11, 5, !dbg !256] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %94 = and i32 %11, 992, !dbg !260] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %95 = icmp ne i32 %94, %11, !dbg !262] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %96 = zext i1 %95 to i32, !dbg !265] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %97 = add nuw nsw i32 %93, %96, !dbg !272] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %.not40.not = icmp ult i32 %9, %97, !dbg !274] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not40.not, label %L302.i, label %L322.i, !dbg !275] = 1 val:1 type: {}
 constantinst[  %98 = add i32 %.fca.1.extract15, -1, !dbg !276] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %99 = sext i32 %98 to i64, !dbg !285] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %100 = getelementptr inbounds double, double addrspace(3)* bitcast ([256 x i8] addrspace(3)* @shmem to double addrspace(3)*), i64 %99, !dbg !285] = 1 val:0 type: {[-1]:Pointer, [-1,0]:Float@double}
 constantinst[  %101 = load double, double addrspace(3)* %100, align 8, !dbg !285, !tbaa !246] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  br label %L322.i, !dbg !285] = 1 val:1 type: {}
 constantinst[  %value_phi46.i = phi double [ %101, %L302.i ], [ %0, %L273.i ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %.not41 = icmp eq i32 %.fca.0.extract14, 1, !dbg !286] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not41, label %L331.i.preheader, label %L357.i, !dbg !289] = 1 val:1 type: {}
 constantinst[  %bitcast_coercion50.i = bitcast double %value_phi46.i to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %102 = lshr i64 %bitcast_coercion50.i, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %103 = trunc i64 %102 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %104 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %103, i32 noundef 1, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %105 = zext i32 %104 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %106 = shl nuw i64 %105, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %107 = trunc i64 %bitcast_coercion50.i to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %108 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %107, i32 noundef 1, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %109 = zext i32 %108 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %110 = or i64 %106, %109, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i = bitcast i64 %110 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %111 = fadd double %value_phi46.i, %bitcast_coercion51.i, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.1 = bitcast double %111 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %112 = lshr i64 %bitcast_coercion50.i.1, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %113 = trunc i64 %112 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %114 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %113, i32 noundef 2, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %115 = zext i32 %114 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %116 = shl nuw i64 %115, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %117 = trunc i64 %bitcast_coercion50.i.1 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %118 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %117, i32 noundef 2, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %119 = zext i32 %118 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %120 = or i64 %116, %119, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.1 = bitcast i64 %120 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %121 = fadd double %111, %bitcast_coercion51.i.1, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.2 = bitcast double %121 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %122 = lshr i64 %bitcast_coercion50.i.2, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %123 = trunc i64 %122 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %124 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %123, i32 noundef 4, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %125 = zext i32 %124 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %126 = shl nuw i64 %125, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %127 = trunc i64 %bitcast_coercion50.i.2 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %128 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %127, i32 noundef 4, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %129 = zext i32 %128 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %130 = or i64 %126, %129, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.2 = bitcast i64 %130 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %131 = fadd double %121, %bitcast_coercion51.i.2, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.3 = bitcast double %131 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %132 = lshr i64 %bitcast_coercion50.i.3, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %133 = trunc i64 %132 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %134 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %133, i32 noundef 8, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %135 = zext i32 %134 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %136 = shl nuw i64 %135, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %137 = trunc i64 %bitcast_coercion50.i.3 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %138 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %137, i32 noundef 8, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %139 = zext i32 %138 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %140 = or i64 %136, %139, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.3 = bitcast i64 %140 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %141 = fadd double %131, %bitcast_coercion51.i.3, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion50.i.4 = bitcast double %141 to i64, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %142 = lshr i64 %bitcast_coercion50.i.4, 32, !dbg !296] = 0 val:0 type: {}
 constantinst[  %143 = trunc i64 %142 to i32, !dbg !299] = 0 val:0 type: {}
 constantinst[  %144 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %143, i32 noundef 16, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %145 = zext i32 %144 to i64, !dbg !302] = 0 val:0 type: {}
 constantinst[  %146 = shl nuw i64 %145, 32, !dbg !304] = 0 val:0 type: {}
 constantinst[  %147 = trunc i64 %bitcast_coercion50.i.4 to i32, !dbg !299] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %148 = call i32 @llvm.nvvm.shfl.sync.down.i32(i32 noundef -1, i32 %147, i32 noundef 16, i32 noundef 31) #13, !dbg !300] = 0 val:0 type: {}
 constantinst[  %149 = zext i32 %148 to i64, !dbg !306] = 0 val:0 type: {}
 constantinst[  %150 = or i64 %146, %149, !dbg !311] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %bitcast_coercion51.i.4 = bitcast i64 %150 to double, !dbg !290] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %151 = fadd double %141, %bitcast_coercion51.i.4, !dbg !313] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  br label %L357.i, !dbg !315] = 1 val:1 type: {}
 constantinst[  %value_phi52.i = phi double [ %value_phi46.i, %L322.i ], [ %151, %L331.i.preheader ]] = 0 val:0 type: {[-1]:Float@double}
 constantinst[  %.not42 = icmp eq i32 %9, 0, !dbg !315] = 1 val:1 type: {[-1]:Integer}
 constantinst[  br i1 %.not42, label %L362.i, label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !318] = 1 val:1 type: {}
 constantinst[  %152 = call i64 @llvm.smax.i64(i64 %.fca.2.0.extract, i64 noundef 0) #13, !dbg !319] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %153 = mul i64 %20, %152, !dbg !337] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %154 = add i64 %unbox3.i, -1, !dbg !342] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %155 = add i64 %154, %153, !dbg !343] = 1 val:1 type: {[-1]:Integer}
 constantinst[  %156 = bitcast i8 addrspace(1)* %.fca.0.extract to double addrspace(1)*, !dbg !351] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  %157 = getelementptr inbounds double, double addrspace(1)* %156, i64 %155, !dbg !351] = 1 val:0 type: {[-1]:Pointer, [-1,-1]:Float@double}
 constantinst[  store double %value_phi52.i, double addrspace(1)* %157, align 8, !dbg !351, !tbaa !174, !noalias !248] = 0 val:1 type: {}
 constantinst[  br label %julia_partial_mapreduce_grid_48233_inner.exit, !dbg !352] = 1 val:1 type: {}
 constantinst[  call void @llvm.lifetime.end.p0i8(i64 noundef 16, i8* noundef nonnull %6) #13, !dbg !353] = 1 val:1 type: {}
 constantinst[  ret void, !dbg !13] = 1 val:1 type: {}
cannot handle unknown binary operator:   %40 = lshr i64 %bitcast_coercion.i, 32, !dbg !202

Stacktrace:
 [1] >>>
   @ ./int.jl:530
 [2] >>>
   @ ./int.jl:538
 [3] shfl_recurse
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:86
 [4] shfl_recurse
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:97
 [5] shfl_down_sync (repeats 2 times)
   @ ~/.julia/dev/CUDA/src/device/intrinsics/warp.jl:78
 [6] reduce_warp
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:12
 [7] reduce_block
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:28
 [8] partial_mapreduce_grid
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:126
 [9] partial_mapreduce_grid
   @ ~/.julia/dev/CUDA/src/mapreduce.jl:0
 was thrown during kernel execution on thread (1, 1, 1) in block (1, 1, 1).
Stacktrace not available, run Julia on debug level 2 for more details (by passing -g2 to the executable).

ERROR: Enzyme execution failed.
Enzyme: Not yet implemented forward for jl_eqtable_get
Stacktrace:
 [1] get
   @ ./iddict.jl:102
 [2] macro expansion
   @ ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:208
 [3] #_mapreduce#42
   @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:71

Stacktrace:
  [1] throwerr(cstr::Cstring)
    @ Enzyme.Compiler ~/.julia/dev/Enzyme/src/compiler.jl:1696
  [2] get
    @ ./iddict.jl:102 [inlined]
  [3] macro expansion
    @ ~/.julia/packages/GPUArraysCore/GMsgk/src/GPUArraysCore.jl:208 [inlined]
  [4] #_mapreduce#42
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:71
  [5] _mapreduce
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:33 [inlined]
  [6] mapreduce
    @ ~/.julia/packages/GPUArrays/bbZD0/src/host/mapreduce.jl:28 [inlined]
  [7] _sum
    @ ./reducedim.jl:1015 [inlined]
  [8] _sum
    @ ./reducedim.jl:1014 [inlined]
  [9] sum
    @ ./reducedim.jl:1010 [inlined]
 [10] f
    @ ./REPL[2]:1 [inlined]
 [11] fwddiffejulia_f_2022wrap
    @ ./REPL[2]:0
 [12] macro expansion
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6673 [inlined]
 [13] enzyme_call
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6273 [inlined]
 [14] ForwardModeThunk
    @ ~/.julia/dev/Enzyme/src/compiler.jl:6153 [inlined]
 [15] autodiff
    @ ~/.julia/dev/Enzyme/src/Enzyme.jl:427 [inlined]
 [16] autodiff(::ForwardMode{FFIABI}, ::typeof(f), ::Type{Duplicated}, ::Duplicated{CuArray{…}}, ::Const{CuArray{…}})
    @ Enzyme ~/.julia/dev/Enzyme/src/Enzyme.jl:326
 [17] top-level scope
    @ REPL[7]:1
Some type information was truncated. Use `show(err)` to see complete types.

Reverse mode will be added in https://github.com/JuliaGPU/CUDA.jl/pull/2422 as I understand it.

EnzymeAD / Enzyme.jl

`CuArray` broadcasting #1454