Open haakon-e opened 1 month ago
cc: @dennisYatunin
@haakon-e, could you please update the reproducer? I think that the existing one has a bug. In particular:
level_field = Fields.level(Fields.Field(Float64, center_space), 1)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)
level_field + ᶠscalar_field
is attempting to add fields on two different spaces, which is (purposefully) not allowed. I assume that this is meant to be (and we still need import ClimaCore: Utilities
above):
level_field = Fields.level(Fields.Field(Float64, face_space), Utilities.half)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)
I think the correct reproducer (as I expect) is:
import ClimaCore: Domains, Meshes, Geometry, Grids, Spaces, Topologies, Hypsography, Fields, Operators, Utilities
import ClimaComms
ENV["CLIMACOMMS_DEVICE"] = "CUDA"
using CUDA
# ENV["CLIMACOMMS_DEVICE"] = "CPU"
comms_ctx = ClimaComms.SingletonCommsContext()
h_domain = Domains.RectangleDomain(
Domains.IntervalDomain(Geometry.XPoint(0.0), Geometry.XPoint(1.0); periodic = true),
Domains.IntervalDomain(Geometry.YPoint(0.0), Geometry.YPoint(1.0); periodic = true)
)
h_mesh = Meshes.RectilinearMesh(h_domain, 10, 10)
h_grid = Spaces.grid(Spaces.SpectralElementSpace2D(
Topologies.DistributedTopology2D(comms_ctx, h_mesh, Topologies.spacefillingcurve(h_mesh)),
Spaces.Quadratures.GLL{4}(),
))
z_domain = Domains.IntervalDomain(Geometry.ZPoint(0.0), Geometry.ZPoint(1.0); boundary_names = (:bottom, :top))
z_grid = Grids.FiniteDifferenceGrid(Topologies.IntervalTopology(
comms_ctx, Meshes.IntervalMesh(z_domain, Meshes.Uniform(); nelems = 10)
))
grid = Grids.ExtrudedFiniteDifferenceGrid(h_grid, z_grid, Hypsography.Flat(); deep = false)
center_space = Spaces.CenterExtrudedFiniteDifferenceSpace(grid)
face_space = Spaces.FaceExtrudedFiniteDifferenceSpace(grid)
# Create fields and show that it fails
ᶜgradᵥ = Operators.GradientF2C()
level_field = Fields.level(Fields.Field(Float64, face_space), Utilities.half)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)
import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);
Alarmingly, neither of these break on the CPU!
Some progress from earlier today:
julia> @inline Operators.reconstruct_placeholder_space(::Operators.LevelPlaceholderSpace, parent_space) =
Spaces.SpectralElementSpace2D(Spaces.level(Spaces.grid(parent_space), Operators.PlusHalf(0)))
julia> @. ᶜgradᵥ(level_field + ᶠscalar_field) ClimaCore.Geometry.Covariant3Vector{Float64}-valued Field: components: data: 1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
- The following method overwrite does not fix the issue:
```julia
julia> @inline Operators.reconstruct_placeholder_space(::Operators.LevelPlaceholderSpace, parent_space) =
Spaces.level(parent_space, Operators.PlusHalf(0))
julia> @. ᶜgradᵥ(level_field + ᶠscalar_field)
ERROR: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.copyto_stencil_kernel!(::ClimaCore.Fields.Field{…}, ::ClimaCore.Operators.StencilBroadcasted{…}, ::ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, ::NTuple{…}, ::ClimaCore.DataLayouts.UniversalSize{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
Reason: unsupported call to an unknown function (call to julia.get_gc_frame_slot)
Reason: unsupported dynamic function invocation (call to level)
Stacktrace:
[1] copyto_stencil_kernel!
@ /central/home/dyatunin/ClimaCore.jl/ext/cuda/operators_finite_difference.jl:0
Reason: unsupported call to an unknown function (call to julia.pop_gc_frame)
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl
We can conclude that the inference problem on GPUs is occurring during dispatch of
level(space::FaceExtrudedFiniteDifferenceSpace3D, v::PlusHalf) =
SpectralElementSpace2D(level(grid(space), v))
This might be happening because the FaceExtrudedFiniteDifferenceSpace3D
type is too complex for the GPU compiler. It should be easy to refactor this so that it no longer dispatches over such complicated types; e.g., we could dispatch over Spaces.staggering(space)
instead of the space
itself.
Note for future self (and @dennisYatunin). One thing I noticed is that we don't always call Base.Broadcast.instantiate
for our datalayout kernels (we sometimes assume that it's been called by the field layer). Doing
import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);
materialize(bc)
reproduces the error above (w.r.t. inference failure in level
), and
import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);
bc = Base.Broadcast.instantiate(bc);
materialize(bc)
yields a different error (unmatched spaces). It may be that this doesn't fix the issue, but I think that we may want to fix this first.
Describe the bug
The following code produces an error:
produces the following error
```julia ERROR: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.copyto_stencil_kernel!(::ClimaCore.Fields.Field{…}, ::ClimaCore.Operators.StencilBroadcasted{…}, ::ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, ::NTuple{…}, ::ClimaCore.DataLayouts.UniversalSize{…}) resulted in invalid LLVM IR Reason: unsupported dynamic function invocation (call to level) Stacktrace: [1] copyto_stencil_kernel! @ ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/operators_finite_difference.jl:0 Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl Stacktrace: [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/validation.jl:147 [2] macro expansion @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:458 [inlined] [3] macro expansion @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined] [4] macro expansion @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:457 [inlined] [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/utils.jl:103 [6] emit_llvm @ ~/.julia/packages/GPUCompiler/Y4hSX/src/utils.jl:97 [inlined] [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:136 [8] codegen @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:115 [inlined] [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:111 [10] compile @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:103 [inlined] [11] #1145 @ ~/.julia/packages/CUDA/Tl08O/src/compiler/compilation.jl:254 [inlined] [12] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{}) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:52 [13] JuliaContext(f::Function) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:42 [14] compile(job::GPUCompiler.CompilerJob) @ CUDA ~/.julia/packages/CUDA/Tl08O/src/compiler/compilation.jl:253 [15] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link)) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/execution.jl:237 [16] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/execution.jl:151 [17] macro expansion @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:369 [inlined] [18] macro expansion @ ./lock.jl:267 [inlined] [19] cufunction(f::typeof(ClimaCoreCUDAExt.copyto_stencil_kernel!), tt::Type{Tuple{ClimaCore.Fields.Field{…}, ClimaCore.Operators.StencilBroadcasted{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, NTuple{…}, ClimaCore.DataLayouts.UniversalSize{…}}}; kwargs::@Kwargs{always_inline::Bool}) @ CUDA ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:364 [20] cufunction @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:361 [inlined] [21] macro expansion @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:112 [inlined] [22] threads_via_occupancy(f!::typeof(ClimaCoreCUDAExt.copyto_stencil_kernel!), args::Tuple{ClimaCore.Fields.Field{…}, ClimaCore.Operators.StencilBroadcasted{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, NTuple{…}, ClimaCore.DataLayouts.UniversalSize{…}}) @ ClimaCoreCUDAExt ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/cuda_utils.jl:94 [23] copyto!(out::ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}}, bc::ClimaCore.Operators.StencilBroadcasted{ClimaCoreCUDAExt.CUDAColumnStencilStyle, ClimaCore.Operators.GradientF2C{…}, Tuple{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}}) @ ClimaCoreCUDAExt ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/operators_finite_difference.jl:27 [24] copy @ ~/.julia/packages/ClimaCore/wHC4M/src/Operators/common.jl:52 [inlined] [25] materialize(opbc::ClimaCore.Operators.StencilBroadcasted{ClimaCore.Operators.StencilStyle, ClimaCore.Operators.GradientF2C{@NamedTuple{}}, Tuple{Base.Broadcast.Broadcasted{ClimaCore.Fields.FieldStyle{…}, Nothing, typeof(ClimaCore.RecursiveApply.radd), Tuple{…}}}, Nothing}) @ ClimaCore.Operators ~/.julia/packages/ClimaCore/wHC4M/src/Operators/common.jl:57 [26] top-level scope @ REPL[6]:1 Some type information was truncated. Use `show(err)` to see complete types. ```To Reproduce
See reproducer:
```julia import Pkg Pkg.activate("benchmarks/3d") # run from `ClimaCore.jl` root import ClimaCore: Domains, Meshes, Geometry, Grids, Spaces, Topologies, Hypsography, Fields, Operators import ClimaComms # Request a GPU device on hpc: # srun -G1 -t 01:00:00 --partition gpu --pty bash -l ENV["CLIMACOMMS_DEVICE"] = "CUDA" using CUDA comms_ctx = ClimaComms.SingletonCommsContext() h_domain = Domains.RectangleDomain( Domains.IntervalDomain(Geometry.XPoint(0.0), Geometry.XPoint(1.0); periodic = true), Domains.IntervalDomain(Geometry.YPoint(0.0), Geometry.YPoint(1.0); periodic = true) ) h_mesh = Meshes.RectilinearMesh(h_domain, 10, 10) h_grid = Spaces.grid(Spaces.SpectralElementSpace2D( Topologies.DistributedTopology2D(comms_ctx, h_mesh, Topologies.spacefillingcurve(h_mesh)), Spaces.Quadratures.GLL{4}(), )) z_domain = Domains.IntervalDomain(Geometry.ZPoint(0.0), Geometry.ZPoint(1.0); boundary_names = (:bottom, :top)) z_grid = Grids.FiniteDifferenceGrid(Topologies.IntervalTopology( comms_ctx, Meshes.IntervalMesh(z_domain, Meshes.Uniform(); nelems = 10) )) grid = Grids.ExtrudedFiniteDifferenceGrid(h_grid, z_grid, Hypsography.Flat(); deep = false) center_space = Spaces.CenterExtrudedFiniteDifferenceSpace(grid) face_space = Spaces.FaceExtrudedFiniteDifferenceSpace(grid) # Create fields and show that it fails ᶜgradᵥ = Operators.GradientF2C() level_field = Fields.level(Fields.Field(Float64, center_space), 1) ᶠscalar_field = Fields.Field(Float64, face_space) # Does not work: @. ᶜgradᵥ(level_field + ᶠscalar_field) ```PS: I'm using the
benchmarks/3d
environment for convenience since it has bothClimaCore
andCUDA
available.System details
Any relevant system information:
versioninfo()
```julia # using climacommon/2024_05_27 julia> versioninfo() Julia Version 1.10.3 Commit 0b4590a5507 (2024-04-30 10:59 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, broadwell) Threads: 1 default, 0 interactive, 1 GC (on 28 virtual cores) Environment: JULIA_MPI_HAS_CUDA = true JULIA_CPU_TARGET = generic;broadwell;skylake-avx512;icelake-server;cascadelake;znver3 JULIA_CUDA_MEMORY_POOL = none JULIA_LOAD_PATH = :/groups/esm/modules/julia-preferences/_2024_02_20 JULIA_CUDA_USE_BINARYBUILDER = false JULIA_NUM_THREADS = auto ```