CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS) when using Lagrangian particles under large CFL number

Yixiao-Zhang commented 1 year ago

In the example below, the model crashes reporting a GPU illegal memory access error. The CFL number is intentionally set to a large value, under which the model will encounter numerical instability. I expect this model should abort itself when NANs appear instead of crashing due to a memory illegal access error. Besides, this only happens when I use Lagrangian particles. If not, the model will terminate by itself as I expect. I have also verified that the model does not crash when the CFL number is small.

using Oceananigans

const Lx = 1.0
const Nx = 50
const Δx = Lx / Nx
const max_velocity = 1.0
const cfl = 10.0
const Δt = cfl * Δx / max_velocity

function initial_u(x::R, y::R, z::R) where {R<:Real}
    return (max_velocity / Lx) * y
end

grid = RectilinearGrid(
    GPU(),
    size = (Nx, Nx, Nx),
    x = (0.0, Lx),
    y = (0.0, Lx),
    z = (0.0, Lx),
    topology = (Periodic, Bounded, Bounded)
)

arch_array = Oceananigans.Architectures.array_type(GPU()){Float64}
n_particles = 1000

xs = convert(arch_array, zeros((n_particles, )))
ys = convert(arch_array, LinRange(0.0, Lx, n_particles))
zs = convert(arch_array, zeros((n_particles, )))

particles = LagrangianParticles(x = xs, y = ys, z = zs)

model = NonhydrostaticModel(;
    grid,
    particles = particles,
)

set!(model, u = initial_u)

simulation = Simulation(model; Δt = Δt, stop_iteration = 200)

run!(simulation)

The output.log is uploaded as a file.

Test environment:

Julia version: v1.9.3
Oceananigans: v0.89.0
Tested on Ubuntu 20.04.6 LTS with CUDA 12.0 and MIT Satori with CUDA 11.4

This example tries to reproduce some of my simulations for convection. In these simulation, I used strong heating, and therefore I expect some of them to crash. However, I did not expect that they would trigger GPU illegal memory access errors.

This issue is probably related to #3267.

glwagner commented 1 year ago

huh, do you think there is a bug in interpolation that causes an out of bounds memory access?

jagoosw commented 1 year ago

I have seen particles cause out of bounds memory access errors on GPU before. I think it happens because the interpolation tries to index into the array at an index that does exist (because it's @inbounds 'ed) causing the error.

jagoosw commented 1 year ago

I don't think this is the same as my error linked from before.

glwagner commented 1 year ago

I have seen particles cause out of bounds memory access errors on GPU before. I think it happens because the interpolation tries to index into the array at an index that does exist (because it's @inbounds 'ed) causing the error.

That's indicative of a bug yes? This could lead to wrong behavior if the accessed memory does exist but is irrelevant (although I guess it depends whether the value is actually used or not)

glwagner commented 1 year ago

Possibly inspecting how boundary conditions are implemented for particles will yield a clue.

jagoosw commented 1 year ago

I have seen particles cause out of bounds memory access errors on GPU before. I think it happens because the interpolation tries to index into the array at an index that does exist (because it's @inbounds 'ed) causing the error.

That's indicative of a bug yes? This could lead to wrong behavior if the accessed memory does exist but is irrelevant (although I guess it depends whether the value is actually used or not)

I guess this would be a bug in that scenario

jagoosw commented 1 year ago

Reading through the error message again I'm not sure its what I thought.

Yixiao-Zhang commented 1 year ago

I find that the code for boundary conditions for particles cannot deal with unusually large velocity (when $u\Delta t$ has the order of the domain size). Fixing the related code (Yixiao-Zhang/Oceananigans.jl@95f68a1) enables running the script that I posted previously in this page.

Do we need an additional test for such cases?

One remaining question is why the log shows that the error occurs from the pressure solver. Can we do anything to improve the accuracy of error messages?

simone-silvestri commented 1 year ago

GPU error messages are a little iffy because the CPU and GPU are not synchronized. The error appears in the pressure solver because it is the first location in the code where the GPU is synchronized (in this case through a memory copy).

There are a couple of ways to catch errors/debug, in general, what I recommend is to

use --check-bounds=yes this will allow you to pinpoint better the issue
run a breaking code on the CPU, everything is easier there! if this is not possible you can use the -g 2 flag to allow GPU debugging but it is still a little more difficult than just migrating the code on CPU and making sure it works.

glwagner commented 1 year ago

@Yixiao-Zhang do you want to open a PR?

Sensible errors on GPU are a persistent problem... this seems like one of the trickiest though...

I think the best easy thing we can do in this case is to add an article in our Wiki about how to debug CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS), making @simone-silvestri 's points.

For documentation the error being referred to is

ERROR: LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
  [2] check
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined]
  [3] cuMemcpyHtoDAsync_v2
    @ ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined]
  [4] #unsafe_copyto!#9
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:397 [inlined]
  [5] (::CUDA.var"#1012#1013"{ComplexF64, CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer}, Int64, Vector{ComplexF64}, Int64, Int64})()
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:464
  [6] #context!#887
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined]
  [7] context!
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
  [8] unsafe_copyto!(dest::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer}, doffs::Int64, src::Vector{ComplexF64}, soffs::Int64, n::Int64)
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:457
  [9] copyto!
    @ ~/.julia/packages/CUDA/35NC6/src/array.jl:415 [inlined]
 [10] setindex!(::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer}, ::ComplexF64, ::Int64, ::Int64, ::Int64)
    @ GPUArrays ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:20
 [11] setindex!
    @ ~/.julia/packages/GPUArrays/5XhED/src/host/indexing.jl:24 [inlined]
 [12] macro expansion
    @ ~/.julia/packages/GPUArraysCore/uOYfN/src/GPUArraysCore.jl:136 [inlined]
 [13] solve!(ϕ::Field{Center, Center, Center, Nothing, RectilinearGrid{Float64, Periodic, Bounded, Bounded, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, GPU}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.Periodic, Nothing}, ...

PS I changed the issue title so it will be more easily findable via search.

Yixiao-Zhang commented 1 year ago

It will be good if we have a standard procedure to debug certain GPU errors.

However, after trying Simone's suggestions, I found that using --check-bounds=yes or running the code on a CPU cannot reproduce the error in the original script posted by me. The script just finishes without reporting any error. I understand that these suggested methods to debug CUDA errors are helpful generally, but they do not work in this case.

glwagner commented 1 year ago

Might be good to open a conversation about this on the #gpu channel on the Julia slack (this is not an Oceananigans-specific issue) to see if anyone in the Julia community has advice. If we have a solid procedure to recommend we can add it to our github wiki.

glwagner commented 1 year ago

I found that using --check-bounds=yes or running the code on a CPU cannot reproduce the error in the original script posted by me.

Interesting --- does that mean that this is not an out-of-bounds issue?

I expect this model should abort itself when NANs appear instead of crashing due to a memory illegal access error.

Are the NaNs appearing in the particle coordinates, or in the model velocity fields? The default NaNChecker only checks the first entry in the model's prognostic field:

https://github.com/CliMA/Oceananigans.jl/blob/00006c17c2ede7f819e15ae14aabb14ab62a0136/src/Simulations/simulation.jl#L71

https://github.com/CliMA/Oceananigans.jl/blob/00006c17c2ede7f819e15ae14aabb14ab62a0136/src/Models/Models.jl#L163-L169

Note also that by default, the NaNChecker is only actuated every 100 iterations. You are free to use a different NaNChecker, however. The balance between the computational cost of checking NaNs / frequency of NaN checking and the cost of a time-step is use-case-specific. Can you clarify why you expect the model to abort itself when NaNs appear? Are you suggesting that we can improve the default NaN checker?

Yixiao-Zhang commented 1 year ago

My statement is somewhat misleading. I should say that I expect the model to abort itself AFTER NaNs appear. It does not have to abort immediately at the first occurrence of NaN, but it needs to abort before an out-of-bounds error. Ideally, we do not want a model to crash due an out-of-bounds error. Instead, it should abort with a error message, and a detailed solution can be found in the document.

Does this answer your question?

simone-silvestri commented 1 year ago

I went back to check your script and you should not expect NaNs in this simulation. The model is initiated only with a u-velocity which is constant in x, there are no x-gradients, the grid is periodic in x and there are no boundary conditions nor viscosity that could affect the velocity field, thus the flow field is steady, and would not change also with much higher CFL numbers. There are no tracers, so no possible NaNs there as well.

Particles are lagrangian, they cannot NaN because they do not undergo numerical instability, the only thing that can happen is that they are shot in the x direction way beyond the periodic x-domain which will understandably produce out-of-bounds results when trying to interpolate the location in the grid.

I believe that If you find NaNs there is a bug because that simulation should not produce any.

You can set debugging level 2 in the Julia invocation (julia -g2 --check-bounds=yes) to have access to a more detailed stack trace on the GPU. https://cuda.juliagpu.org/stable/development/debugging/ Note that you need CUDA version 11.5 or higher to have access to a detailed stack trace

glwagner commented 1 year ago

Would a NaN in the velocity field cause an out-of-bounds error? Probably ensuring that does not happen would be relatively easy to fix.

will understandably produce out-of-bounds results when trying to interpolate the location in the grid.

This is incorrect behavior though, right @simone-silvestri ? I thought the problem was with Bounded, not Periodic. The issue is when the velocity field + time-step is so high that multiple reflections across the entirety of a bounded direction occur in a single time-step. We don't model that situation. We can only handle a single reflection.

Yixiao-Zhang commented 1 year ago

I just checked the time evolution of the velocity field in this case. It seems to me that shear instability occurs when running the script on a GPU. However, the flow is steady when running the script on a CPU. Does CUDA introduce floating-point error that has x-dependence? Perhaps from the pressure solver?

glwagner commented 1 year ago

It shouldn't, unless you are looking at a regime that isn't covered in our regression tests / tests that closely inspect the difference between CPU and GPU results.

simone-silvestri commented 1 year ago

Would a NaN in the velocity field cause an out-of-bounds error? Probably ensuring that does not happen would be relatively easy to fix.

will understandably produce out-of-bounds results when trying to interpolate the location in the grid.

This is incorrect behavior though, right @simone-silvestri ? I thought the problem was with Bounded, not Periodic. The issue is when the velocity field + time-step is so high that multiple reflections across the entirety of a bounded direction occur in a single time-step. We don't model that situation. We can only handle a single reflection.

The particles should not move in Bounded directions as there is no v-velocity (and no way to produce one because of no Coriolis, no viscosity, and no friction at the boundaries). Particles should move only along the x-direction where they move out-of-bounds because CFL=10 and the particle shoots out of the domain. Shear instability should not happen because there is no viscosity/coriolis/buoyancy. In my opinion the unexpected behavior (to be corrected) is to find a NaN or a reflection. @Yixiao-Zhang can you try running on CPU with --check-bounds=yes or on GPU with -g2 --check-bounds=yes you should get more information about the issue.

Yixiao-Zhang commented 1 year ago

When running the script on a CPU with --check-bounds=yes, the script can be finished without errors:

[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (112.172 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (3.376 seconds).
[ Info: Simulation is stopping after running for 7.945 seconds.
[ Info: Model iteration 200 equals or exceeds stop iteration 200.

I should make it clear that CFL=10 is not large enough to make particles move out of the domain. CFL should be larger than Nx, because the distance at which a particle moves in one time step needs to larger than the domain size (not the grid size). That means $u \Delta t > L_x$, which is equivalent to CFL > Nx.

Using a CFL of 51 (Nx = 50 in this case) reproduces the error on a CPU:

[ Info: Initializing simulation...
[ Info:     ... simulation initialization complete (108.842 ms)
[ Info: Executing initial time step...
[ Info:     ... initial time step complete (3.303 seconds).
ERROR: LoadError: BoundsError: attempt to access 56×56×56 OffsetArray(::Array{Float64, 3}, -2:53, -2:53, -2:53) with eltype Float64 with indices -2:53×-2:53×-2:53 at index [54, 49, 1]
Stacktrace:
  [1] throw_boundserror(A::OffsetArrays.OffsetArray{Float64, 3, Array{Float64, 3}}, I::Tuple{Int64, Int64, Int64})
    @ Base ./abstractarray.jl:744
  [2] checkbounds
    @ ./abstractarray.jl:709 [inlined]
  [3] getindex
    @ ~/.julia/packages/OffsetArrays/0MOrf/src/OffsetArrays.jl:420 [inlined]
  [4] getindex
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Fields/field.jl:399 [inlined]
  [5] getindex
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Utils/sum_of_arrays.jl:23 [inlined]
  [6] _interpolate
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Fields/interpolate.jl:148 [inlined]
  [7] interpolate
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Fields/interpolate.jl:197 [inlined]
  [8] advect_particle
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Models/LagrangianParticleTracking/lagrangian_particle_advection.jl:83 [inlined]
  [9] macro expansion
    @ ~/Documents/IdealizedOceanWorlds.jl/OceananigansMemeoryIssue.jl/dev/Oceananigans/src/Models/LagrangianParticleTracking/lagrangian_particle_advection.jl:145 [inlined]
 [10] cpu__advect_particles!
    @ ~/.julia/packages/KernelAbstractions/cWlFz/src/macros.jl:276 [inlined]
...

I do not see why viscosity/Coriolis/buoyancy is required for shear instability. Anyway, I found another way to reproduce the error on CPU, while keeping CFL = 10, is to introduce a small perturbation to the initial velocity field:

function initial_u(x::R, y::R, z::R) where {R<:Real}
    ϵ = 1e-7
    return (max_velocity / Lx) * y + ϵ * max_velocity * sin(6π * x / Lx)
end

The output error message is the same as the previous one, in which CFL = 51.

Yixiao-Zhang commented 1 year ago

I think I understand how particles cause an out-of-bounds error when the velocity is large. However, the difference between CPU and GPU simulations is another issue. Let me see whether I can reproduce this issue with a smaller CFL.

simone-silvestri commented 1 year ago

It's a problem of particles being advected in the x-direction. You can see that the out-of-bounds error is in the x-direction ERROR: LoadError: BoundsError: attempt to access 56×56×56 OffsetArray(::Array{Float64, 3}, -2:53, -2:53, -2:53) with eltype Float64 with indices -2:53×-2:53×-2:53 at index [54, 49, 1] You are moving out of the periodic grid because the particle is being advanced beyond the periodic domain. There is no bouncing happening and it is not a problem of Bounded directions because the particle is not moving in the y- or in the z-direction.

You are prescribing a steady state flow which is characterized by a u-velocity only. There is no tendency term that can develop a shear instability since a y-gradient in u is stable if there are no additional frictional forces, i.e.: $$\frac{\partial u}{\partial t} = - \frac{\partial uu}{\partial x} - \frac{\partial uv}{\partial y} - \frac{\partial uw}{\partial z} - \frac{\partial p}{\partial x}$$ All the terms on the RHS of this equation are zero because

$v = 0$
$w = 0$
$\frac{\partial uu}{\partial x} = 0$
$\frac{\partial p}{\partial x} = 0$ since $\delta_x u^ + \delta_y v^ + \delta_z w^* = 0$ and you have periodic boundary conditions in the x-direction

from how you initialized it, the flow cannot change, irrespective of your CFL (if you remove your particles you'll see that the code will run indefinitely without changing, even with CFL = 100).

In your second case, when you change the velocity to

function initial_u(x::R, y::R, z::R) where {R<:Real}
    ϵ = 1e-7
    return (max_velocity / Lx) * y + ϵ * max_velocity * sin(6π * x / Lx)
end

you are initializing your solution with a divergent flow $\partial_x u + \partial_y v + \partial_z w \ne 0$ which is not "admissible" in an incompressible model (such as Oceananigans' non-hydrostatic-model). The initialization then triggers a pressure correction which will act to suppress the divergence in your initial conditions, by either including a y-gradient in v, a z-gradient in w to balance the x-gradient of u or by removing the x-gradient in u (not sure what pressure decides to do here, but I suspect it will add lateral flow components). In principle, this is a different dynamical case because if you introduce lateral velocity then tendencies will not be null anymore and the flow is not steady anymore. In this case, the flow can experience numerical instabilities when CFL > 1

The out-of-bounds issue is not the problem of the CFL being larger than the grid size, it's a problem of the CFL being larger than the size of the halo. You see this issue when the CFL is larger than the grid size because you initialize your particles at x = 0. If you initialize your particles at x = Lx you will have out-of-bounds problems with a much lower CFL (still larger than 1)

This said we have a couple of ways to tackle this problem

we issue a warning when CFL > 1 (in theory we should not run a simulation with CFL > 1 but that might be desired when we prescribe the velocity)
we wrap around the particle location if it falls outside of the domain in the periodic direction

The issue of CPU vs GPU sounds quite concerning, I ll try to investigate what is going on in your script

glwagner commented 1 year ago

(in theory we should not run a simulation with CFL > 1 but that might be desired when we prescribe the velocity)

Aren't there time stepping schemes that can handle CFL > 1? This depends on the time-stepping scheme. Le and Moin 1991 claim that the theoretical limit for RK3 is 1.6. But there are other RK schemes with more stages that can handle even higher CFLs. I'm not sure there is a "theory" that places an absolute limit on the CFL for all possible schemes.

glwagner commented 1 year ago

You are moving out of the periodic grid because the particle is being advanced beyond the periodic domain.

How is it possible to leave a periodic domain? The domain is infinite if it is periodic.

simone-silvestri commented 1 year ago

yeah there might be some bug here https://github.com/CliMA/Oceananigans.jl/blob/d0b7ec8f98c860ce49927e0a7214961d2f47fb75/src/Models/LagrangianParticleTracking/lagrangian_particle_advection.jl#L28

glwagner commented 1 year ago

CliMA / Oceananigans.jl

CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS) when using Lagrangian particles under large CFL number #3320