CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
1k stars 196 forks source link

CUDA error with immersed boundary when running on GPU #2479

Closed raphaelouillon closed 2 years ago

raphaelouillon commented 2 years ago

While running tests on the new boundary condition implementation on an immersed boundary I encountered a CUDA error that appears to only pop up when using an immersed boundary. I repeated the error on two machines (Satori and my personal machine). I am running on Julia 1.7.2 with Nvidia Driver Version: 510.60.02 and CUDA Version: 11.6. I attach the start of the error message. @glwagner, do you know of any particularity of the immersed boundary that might cause this?

[ Info: Initializing simulation...
[ Info: [0.00%], iteration: 0, time: 0.000
[ Info:     ... simulation initialization complete (1.209 seconds)
[ Info: Executing initial time step...
ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/error.jl:105
 [2] query
   @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:102 [inlined]
 [3] synchronize(stream::CUDA.CuStream; blocking::Bool)
   @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:117
 [4] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/stream.jl:117 [inlined]
 [5] top-level scope
   @ ~/.julia/packages/CUDA/DL5Zo/src/initialization.jl:54

caused by: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/error.jl:105
  [2] macro expansion
    @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/error.jl:115 [inlined]
  [3] cuCtxSynchronize()
    @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/utils/call.jl:26
  [4] device_synchronize
    @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/context.jl:319 [inlined]
  [5] CUDA.CuModule(data::Vector{UInt8}, options::Dict{CUDA.CUjit_option_enum, Any})
    @ CUDA ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/module.jl:41
  [6] CuModule
    @ ~/.julia/packages/CUDA/DL5Zo/lib/cudadrv/module.jl:23 [inlined]
  [7] cufunction_link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA ~/.julia/packages/CUDA/DL5Zo/src/compiler/execution.jl:442
  [8] cached_compilation(cache::Dict{UInt64, Any}, job::GPUCompiler.CompilerJob, compiler::typeof(CUDA.cufunction_compile), linker::typeof(CUDA.cufunction_link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/fG3xK/src/cache.jl:94
...
glwagner commented 2 years ago

Can you come up with a minimal example? I'll see if I can reproduce it.

glwagner commented 2 years ago

@simone-silvestri do you have anything to offer?

I think issues like this do make it a bit more pressing to update to latest CUDA + KernelAbstractions, which we've been slacking on for quite a while now.

simone-silvestri commented 2 years ago

If you post a MWE, I can try to look at the issue. We should indeed update...

raphaelouillon commented 2 years ago

@glwagner this should reproduce the error. The message is thrown when calling set(model,...)


using Oceananigans.ImmersedBoundaries: ImmersedBoundaryGrid, GridFittedBottom, mask_immersed_field!

using Oceananigans.Architectures: device
using Oceananigans.Grids: xnode, znode
using KernelAbstractions: MultiEvent
using JLD2
using Printf
#using GLMakie
using SpecialFunctions

arch = GPU()
Nx = 256
Nz = 64 # Resolution
#Ny = 64
κ = 1e-6 # Diffusivity and viscosity (Prandtl = 1)

underlying_grid = RectilinearGrid(arch,
                                  size = (Nx, Nz),
                                  x = (0, 5),
                                  z = (-0.05, 1.0),
                                  halo = (3, 3),
                                  topology = (Bounded, Flat, Bounded))

#const gamx = 2.0
#const gamy = 20.0

@inline bottom_topography(x,y) =  0.0;#*exp.(-gamy*y.^2)
grid = ImmersedBoundaryGrid(underlying_grid, GridFittedBottom(bottom_topography))

no_slip = ValueBoundaryCondition(0)
u_bcs = FieldBoundaryConditions(bottom=no_slip, immersed=no_slip)
w_bcs = FieldBoundaryConditions(immersed=no_slip)
boundary_conditions = (; u = u_bcs, w = w_bcs)

model = NonhydrostaticModel(grid = grid,
                            advection = WENO5(),
                            boundary_conditions = boundary_conditions,
                            closure =  ScalarDiffusivity(ν=κ,κ=κ),
                            coriolis = nothing,
                            tracers = :b,
                            buoyancy = BuoyancyTracer())

b₀(x,y, z) =0.5*(erf.((x.-1.0)*10).-1.0)
set!(model, u = 0.0, b = b₀)```
raphaelouillon commented 2 years ago

@glwagner this is also in immersed_flat_lock_release.jl in here.

glwagner commented 2 years ago

I'll take a look! One quick thing is not to use broadcasting (dots) in your initial condition function. This function is evaluated at x, y, z points -- or in other words, set! does the broadcasting for you. I doubt that's the issue though.

raphaelouillon commented 2 years ago

Hi @glwagner , I saw this thread from a few weeks ago that seemed relevant, but I don't see am obvious way of applying the fix to my script.

glwagner commented 2 years ago

Did you try avoiding broadcasting ? That's the first thing to do. Then we can take a closer look at the initial condition function and see if there's more to change.

raphaelouillon commented 2 years ago

@glwagner I did, it did not resolve the issue however.

glwagner commented 2 years ago

Actually I think the fact that the error comes from set!(model, ...) might imply deeper issues. Can you set fields individually? You might try figuring out exactly what kernel is causing the problem...

raphaelouillon commented 2 years ago

@glwagner the same error is returned whether I set the velocity or buoyancy tracer fields individually.

glwagner commented 2 years ago

How about set!(u, something) (I'm not sure I was clear about what I asked, sorry about that)

simone-silvestri commented 2 years ago

Running on julia debug level 2: the error seems to be an out-of-bounds array access coming from _mask_immersed_field! triggered at mask_immersed_velocities! (which is called in set!(model; kwargs...).

Indeed, if you set!(b, b₀) the error disappears, but it is still there whenever you will call mask_immersed_field! (i.e., when updating the model). The output is a little difficult to interpret but the error seems to be located here

peripheral_node(::Center, ::Face, ::Center, ::Int64, ::Int64, ::Int64, ::ImmersedBoundaryGrid{Float64, Bounded, Flat, 
Bounded, RectilinearGrid{Float64, Bounded, Flat, Bounded, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, 
StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, StepRangeLen{Float64, 
Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}, OffsetArrays.OffsetVector{Float64, 
StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, Nothing}, 
GridFittedBottom{OffsetArrays.OffsetMatrix{Float64, CUDA.CuDeviceMatrix{Float64, 1}}}, Nothing}) at 
/home/ssilvest/Oceananigans.jl/src/Grids/inactive_node.jl:104

So it seems like it is trying to do a peripheral_node of the v velocity, which in this case is flattened since the direction is Flat. In this case it will try to do

@inline peripheral_node(LX, ::Face, LZ, i, j, k, grid) = inactive_cell(i, j, k, grid) | inactive_cell(i, j-1, k, grid)

hitting inactive_cell(i, 0, k, grid). That is not a problem for a normal grid bit for an immersed grid it is because it evaluates immersed_cell at a zero location.

glwagner commented 2 years ago

Nice sleuthing!

simone-silvestri commented 2 years ago

Yeah just writing down the whole rationale to remember it and fix it later :)

raphaelouillon commented 2 years ago

Thanks @simone-silvestri @glwagner! I encountered a different error on CPU recently that might be related. I ran a 3D (Bounded,Bounded,Bounded) simulation with immersed boundary on CPU. When using FieldTimeSeries to load the JLD2 output, I get a BoundsError message:

Error showing value of type FieldTimeSeries{Center, Center, Center, InMemory, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 4, Array{Float64, 4}}, ImmersedBoundaryGrid{Float64, Bounded, Bounded, Bounded, RectilinearGrid{Float64, Bounded, Bounded, Bounded, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, CPU}, GridFittedBottom{OffsetArrays.OffsetMatrix{Float64, Matrix{Float64}}}, CPU}, Float64, FieldBoundaryConditions{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Vector{Float64}}:
ERROR: BoundsError: attempt to access 512×512×128×10 FieldTimeSeries{Center, Center, Center, InMemory, Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}}, OffsetArrays.OffsetArray{Float64, 4, Array{Float64, 4}}, ImmersedBoundaryGrid{Float64, Bounded, Bounded, Bounded, RectilinearGrid{Float64, Bounded, Bounded, Bounded, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, CPU}, GridFittedBottom{OffsetArrays.OffsetMatrix{Float64, Matrix{Float64}}}, CPU}, Float64, FieldBoundaryConditions{Nothing, Nothing, Nothing, Nothing, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Vector{Float64}} at index [1, 1, 1]
Stacktrace:
simone-silvestri commented 2 years ago

It seems to be only a show problem, so you should be able to use FieldTimeSeries anyways. It has to do with showing a 4 dimensional array (field.data) with methods implemented for 3-dimensional arrays. thanks for reporting it! I ll fix it

raphaelouillon commented 2 years ago

Ah yes indeed, sorry I should have specified that the data sets were still usable, I was just confused by the error.