CliMA / ClimaCore.jl

CliMA model dycore
https://clima.github.io/ClimaCore.jl/dev
Apache License 2.0
87 stars 8 forks source link

gradient on combination of horizontal and 3d fields does not work #1989

Open haakon-e opened 1 month ago

haakon-e commented 1 month ago

Describe the bug

The following code produces an error:

ᶜgradᵥ = Operators.GradientF2C()

level_field = Fields.level(Fields.Field(Float64, center_space), 1)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)
produces the following error ```julia ERROR: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.copyto_stencil_kernel!(::ClimaCore.Fields.Field{…}, ::ClimaCore.Operators.StencilBroadcasted{…}, ::ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, ::NTuple{…}, ::ClimaCore.DataLayouts.UniversalSize{…}) resulted in invalid LLVM IR Reason: unsupported dynamic function invocation (call to level) Stacktrace: [1] copyto_stencil_kernel! @ ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/operators_finite_difference.jl:0 Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl Stacktrace: [1] check_ir(job::GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, args::LLVM.Module) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/validation.jl:147 [2] macro expansion @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:458 [inlined] [3] macro expansion @ ~/.julia/packages/TimerOutputs/Lw5SP/src/TimerOutput.jl:253 [inlined] [4] macro expansion @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:457 [inlined] [5] emit_llvm(job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, only_entry::Bool, validate::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/utils.jl:103 [6] emit_llvm @ ~/.julia/packages/GPUCompiler/Y4hSX/src/utils.jl:97 [inlined] [7] codegen(output::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool, parent_job::Nothing) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:136 [8] codegen @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:115 [inlined] [9] compile(target::Symbol, job::GPUCompiler.CompilerJob; libraries::Bool, toplevel::Bool, optimize::Bool, cleanup::Bool, strip::Bool, validate::Bool, only_entry::Bool) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:111 [10] compile @ ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:103 [inlined] [11] #1145 @ ~/.julia/packages/CUDA/Tl08O/src/compiler/compilation.jl:254 [inlined] [12] JuliaContext(f::CUDA.var"#1145#1148"{GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}; kwargs::@Kwargs{}) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:52 [13] JuliaContext(f::Function) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/driver.jl:42 [14] compile(job::GPUCompiler.CompilerJob) @ CUDA ~/.julia/packages/CUDA/Tl08O/src/compiler/compilation.jl:253 [15] actual_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, world::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link)) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/execution.jl:237 [16] cached_compilation(cache::Dict{Any, CuFunction}, src::Core.MethodInstance, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, compiler::Function, linker::Function) @ GPUCompiler ~/.julia/packages/GPUCompiler/Y4hSX/src/execution.jl:151 [17] macro expansion @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:369 [inlined] [18] macro expansion @ ./lock.jl:267 [inlined] [19] cufunction(f::typeof(ClimaCoreCUDAExt.copyto_stencil_kernel!), tt::Type{Tuple{ClimaCore.Fields.Field{…}, ClimaCore.Operators.StencilBroadcasted{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, NTuple{…}, ClimaCore.DataLayouts.UniversalSize{…}}}; kwargs::@Kwargs{always_inline::Bool}) @ CUDA ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:364 [20] cufunction @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:361 [inlined] [21] macro expansion @ ~/.julia/packages/CUDA/Tl08O/src/compiler/execution.jl:112 [inlined] [22] threads_via_occupancy(f!::typeof(ClimaCoreCUDAExt.copyto_stencil_kernel!), args::Tuple{ClimaCore.Fields.Field{…}, ClimaCore.Operators.StencilBroadcasted{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, NTuple{…}, ClimaCore.DataLayouts.UniversalSize{…}}) @ ClimaCoreCUDAExt ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/cuda_utils.jl:94 [23] copyto!(out::ClimaCore.Fields.Field{ClimaCore.DataLayouts.VIJFH{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}}, bc::ClimaCore.Operators.StencilBroadcasted{ClimaCoreCUDAExt.CUDAColumnStencilStyle, ClimaCore.Operators.GradientF2C{…}, Tuple{…}, ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}}) @ ClimaCoreCUDAExt ~/.julia/packages/ClimaCore/wHC4M/ext/cuda/operators_finite_difference.jl:27 [24] copy @ ~/.julia/packages/ClimaCore/wHC4M/src/Operators/common.jl:52 [inlined] [25] materialize(opbc::ClimaCore.Operators.StencilBroadcasted{ClimaCore.Operators.StencilStyle, ClimaCore.Operators.GradientF2C{@NamedTuple{}}, Tuple{Base.Broadcast.Broadcasted{ClimaCore.Fields.FieldStyle{…}, Nothing, typeof(ClimaCore.RecursiveApply.radd), Tuple{…}}}, Nothing}) @ ClimaCore.Operators ~/.julia/packages/ClimaCore/wHC4M/src/Operators/common.jl:57 [26] top-level scope @ REPL[6]:1 Some type information was truncated. Use `show(err)` to see complete types. ```

To Reproduce

See reproducer: ```julia import Pkg Pkg.activate("benchmarks/3d") # run from `ClimaCore.jl` root import ClimaCore: Domains, Meshes, Geometry, Grids, Spaces, Topologies, Hypsography, Fields, Operators import ClimaComms # Request a GPU device on hpc: # srun -G1 -t 01:00:00 --partition gpu --pty bash -l ENV["CLIMACOMMS_DEVICE"] = "CUDA" using CUDA comms_ctx = ClimaComms.SingletonCommsContext() h_domain = Domains.RectangleDomain( Domains.IntervalDomain(Geometry.XPoint(0.0), Geometry.XPoint(1.0); periodic = true), Domains.IntervalDomain(Geometry.YPoint(0.0), Geometry.YPoint(1.0); periodic = true) ) h_mesh = Meshes.RectilinearMesh(h_domain, 10, 10) h_grid = Spaces.grid(Spaces.SpectralElementSpace2D( Topologies.DistributedTopology2D(comms_ctx, h_mesh, Topologies.spacefillingcurve(h_mesh)), Spaces.Quadratures.GLL{4}(), )) z_domain = Domains.IntervalDomain(Geometry.ZPoint(0.0), Geometry.ZPoint(1.0); boundary_names = (:bottom, :top)) z_grid = Grids.FiniteDifferenceGrid(Topologies.IntervalTopology( comms_ctx, Meshes.IntervalMesh(z_domain, Meshes.Uniform(); nelems = 10) )) grid = Grids.ExtrudedFiniteDifferenceGrid(h_grid, z_grid, Hypsography.Flat(); deep = false) center_space = Spaces.CenterExtrudedFiniteDifferenceSpace(grid) face_space = Spaces.FaceExtrudedFiniteDifferenceSpace(grid) # Create fields and show that it fails ᶜgradᵥ = Operators.GradientF2C() level_field = Fields.level(Fields.Field(Float64, center_space), 1) ᶠscalar_field = Fields.Field(Float64, face_space) # Does not work: @. ᶜgradᵥ(level_field + ᶠscalar_field) ```

PS: I'm using the benchmarks/3d environment for convenience since it has both ClimaCore and CUDA available.

System details

Any relevant system information:

versioninfo() ```julia # using climacommon/2024_05_27 julia> versioninfo() Julia Version 1.10.3 Commit 0b4590a5507 (2024-04-30 10:59 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 28 × Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, broadwell) Threads: 1 default, 0 interactive, 1 GC (on 28 virtual cores) Environment: JULIA_MPI_HAS_CUDA = true JULIA_CPU_TARGET = generic;broadwell;skylake-avx512;icelake-server;cascadelake;znver3 JULIA_CUDA_MEMORY_POOL = none JULIA_LOAD_PATH = :/groups/esm/modules/julia-preferences/_2024_02_20 JULIA_CUDA_USE_BINARYBUILDER = false JULIA_NUM_THREADS = auto ```
haakon-e commented 1 month ago

cc: @dennisYatunin

charleskawczynski commented 1 month ago

@haakon-e, could you please update the reproducer? I think that the existing one has a bug. In particular:

level_field = Fields.level(Fields.Field(Float64, center_space), 1)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)

level_field + ᶠscalar_field is attempting to add fields on two different spaces, which is (purposefully) not allowed. I assume that this is meant to be (and we still need import ClimaCore: Utilities above):

level_field = Fields.level(Fields.Field(Float64, face_space), Utilities.half)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)
charleskawczynski commented 1 month ago

I think the correct reproducer (as I expect) is:

import ClimaCore: Domains, Meshes, Geometry, Grids, Spaces, Topologies, Hypsography, Fields, Operators, Utilities
import ClimaComms

ENV["CLIMACOMMS_DEVICE"] = "CUDA"
using CUDA
# ENV["CLIMACOMMS_DEVICE"] = "CPU"
comms_ctx = ClimaComms.SingletonCommsContext()

h_domain = Domains.RectangleDomain(
    Domains.IntervalDomain(Geometry.XPoint(0.0), Geometry.XPoint(1.0); periodic = true), 
    Domains.IntervalDomain(Geometry.YPoint(0.0), Geometry.YPoint(1.0); periodic = true)
)
h_mesh = Meshes.RectilinearMesh(h_domain, 10, 10)
h_grid = Spaces.grid(Spaces.SpectralElementSpace2D(
    Topologies.DistributedTopology2D(comms_ctx, h_mesh, Topologies.spacefillingcurve(h_mesh)),
    Spaces.Quadratures.GLL{4}(),
))
z_domain = Domains.IntervalDomain(Geometry.ZPoint(0.0), Geometry.ZPoint(1.0); boundary_names = (:bottom, :top))
z_grid = Grids.FiniteDifferenceGrid(Topologies.IntervalTopology(
    comms_ctx, Meshes.IntervalMesh(z_domain, Meshes.Uniform(); nelems = 10)
))
grid = Grids.ExtrudedFiniteDifferenceGrid(h_grid, z_grid, Hypsography.Flat(); deep = false)
center_space = Spaces.CenterExtrudedFiniteDifferenceSpace(grid)
face_space = Spaces.FaceExtrudedFiniteDifferenceSpace(grid)

# Create fields and show that it fails
ᶜgradᵥ = Operators.GradientF2C()

level_field = Fields.level(Fields.Field(Float64, face_space), Utilities.half)
ᶠscalar_field = Fields.Field(Float64, face_space)
# Does not work:
@. ᶜgradᵥ(level_field + ᶠscalar_field)

import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);

Alarmingly, neither of these break on the CPU!

dennisYatunin commented 1 month ago

Some progress from earlier today:

julia> @. ᶜgradᵥ(level_field + ᶠscalar_field) ClimaCore.Geometry.Covariant3Vector{Float64}-valued Field: components: data: 1: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

- The following method overwrite does not fix the issue:
```julia
julia> @inline Operators.reconstruct_placeholder_space(::Operators.LevelPlaceholderSpace, parent_space) =
    Spaces.level(parent_space, Operators.PlusHalf(0))

julia> @. ᶜgradᵥ(level_field + ᶠscalar_field)
ERROR: InvalidIRError: compiling MethodInstance for ClimaCoreCUDAExt.copyto_stencil_kernel!(::ClimaCore.Fields.Field{…}, ::ClimaCore.Operators.StencilBroadcasted{…}, ::ClimaCore.Spaces.ExtrudedFiniteDifferenceSpace{…}, ::NTuple{…}, ::ClimaCore.DataLayouts.UniversalSize{…}) resulted in invalid LLVM IR
Reason: unsupported call to an unknown function (call to julia.new_gc_frame)
Reason: unsupported call to an unknown function (call to julia.push_gc_frame)
Reason: unsupported call to an unknown function (call to julia.get_gc_frame_slot)
Reason: unsupported dynamic function invocation (call to level)
Stacktrace:
 [1] copyto_stencil_kernel!
   @ /central/home/dyatunin/ClimaCore.jl/ext/cuda/operators_finite_difference.jl:0
Reason: unsupported call to an unknown function (call to julia.pop_gc_frame)
Hint: catch this exception as `err` and call `code_typed(err; interactive = true)` to introspect the erronous code with Cthulhu.jl

We can conclude that the inference problem on GPUs is occurring during dispatch of

level(space::FaceExtrudedFiniteDifferenceSpace3D, v::PlusHalf) =
    SpectralElementSpace2D(level(grid(space), v))

This might be happening because the FaceExtrudedFiniteDifferenceSpace3D type is too complex for the GPU compiler. It should be easy to refactor this so that it no longer dispatches over such complicated types; e.g., we could dispatch over Spaces.staggering(space) instead of the space itself.

charleskawczynski commented 3 weeks ago

Note for future self (and @dennisYatunin). One thing I noticed is that we don't always call Base.Broadcast.instantiate for our datalayout kernels (we sometimes assume that it's been called by the field layer). Doing

import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);
materialize(bc)

reproduces the error above (w.r.t. inference failure in level), and

import LazyBroadcast as LB
bc = LB.@lazy @. ᶜgradᵥ(level_field + ᶠscalar_field);
bc = Base.Broadcast.instantiate(bc);
materialize(bc)

yields a different error (unmatched spaces). It may be that this doesn't fix the issue, but I think that we may want to fix this first.