CliMA / Oceananigans.jl

🌊 Julia software for fast, friendly, flexible, ocean-flavored fluid dynamics on CPUs and GPUs
https://clima.github.io/OceananigansDocumentation/stable
MIT License
971 stars 193 forks source link

GPU illegal memory access #3267

Closed jagoosw closed 1 year ago

jagoosw commented 1 year ago

Hi all,

I'm stuck trying to debug an error I keep getting when running a non-hydrostatic model on GPU.

It runs for a bit and then throws this error:

... (loads of similar CUDA stuff that goes on for a very very long time)
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined]
 [16] context!
    @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined]
 [17] unsafe_free!(xs::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream) 
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129
 [18] unsafe_finalize!(xs::CUDA.CuArray{ComplexF64, 3, CUDA.Mem.DeviceBuffer})
    @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150
 [19] top-level scope
    @ ~/.julia/packages/InteractiveErrors/JOo2y/src/InteractiveErrors.jl:329
 [20] eval
    @ ./boot.jl:370 [inlined]
 [21] eval_user_input(ast::Any, backend::REPL.REPLBackend, mod::Module)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:153
 [22] repl_backend_loop(backend::REPL.REPLBackend, get_module::Function)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:249
 [23] start_repl_backend(backend::REPL.REPLBackend, consumer::Any; get_module::Function)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:234
 [24] run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on_current_task::Bool, backend::Any)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:379
 [25] run_repl(repl::REPL.AbstractREPL, consumer::Any)
    @ REPL /rds/user/js2430/hpc-work/julia-1.9.2/share/julia/stdlib/v1.9/REPL/src/REPL.jl:365
 [26] (::Base.var"#1017#1019"{Bool, Bool, Bool})(REPL::Module)
    @ Base ./client.jl:421
 [27] #invokelatest#2
    @ ./essentials.jl:816 [inlined]
 [28] invokelatest
    @ ./essentials.jl:813 [inlined]
 [29] run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, history_file::Bool, color_set::Bool)
    @ Base ./client.jl:405
 [30] exec_options(opts::Base.JLOptions)
    @ Base ./client.jl:322
 [31] _start()
    @ Base ./client.jl:522
LoadError: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
in expression starting at /rds/user/js2430/hpc-work/Eady/eady.jl:133
 >   (stacktrace)
      (user)
       CUDA
   +    throw_api_error ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27
   +   [inlined]
       CUDA
   +    cuOccupancyMaxPotentialBlockSize ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26
   +    #launch_configuration#875 ~/.julia/packages/CUDA/35NC6/lib/cudadrv/occupancy.jl:63
   +   [inlined]
v      CUDA
   +    cuOccupancyMaxPotentialBlockSize ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26
   +    #launch_configuration#875 ~/.julia/packages/CUDA/35NC6/lib/cudadrv/occupancy.jl:63
   +   [inlined]
       CUDA
   +    #mapreducedim!#1119 ~/.julia/packages/CUDA/35NC6/src/mapreduce.jl:236
   +   [inlined]
       GPUArrays
 > +    #_mapreduce#31 ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:69
v  +   [inlined]
     GPUArrays
   +    #_mapreduce#31 ~/.julia/packages/GPUArrays/5XhED/src/host/mapreduce.jl:69
   +   [inlined]
       Oceananigans.Solvers
   +    solve! ~/.julia/packages/Oceananigans/mwXt0/src/Solvers/fourier_tridiagonal_poisson_solver.jl:134
   +   [inlined]
       Oceananigans.Models.NonhydrostaticModels
   +    calculate_pressure_correction! ~/.julia/packages/Oceananigans/mwXt0/src/Models/NonhydrostaticModels/pressure_correction.jl:15
 >     Oceananigans.TimeSteppers
v  +    #time_step!#8 ~/.julia/packages/Oceananigans/mwXt0/src/TimeSteppers/runge_kutta_3.jl:138
      Oceananigans.Simulations
   +    time_step! ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:134
   +    #run!#7 ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:97
   +    run! ~/.julia/packages/Oceananigans/mwXt0/src/Simulations/run.jl:85
   +   [top-level]
         (system)

I can't get the whole error message because its longer than the screen length but this seems to be the relevant bit when using InteractiveErrors.

If I make the grid smaller it gets more iterations done before it errors but is nowhere near using all of the GPUs memory (A100 with 80GB and model is about 2GB when 256x256x64).

This is with the latest version of Oceananigans (87.4). I'll try to make an MWE.

jagoosw commented 1 year ago

When I exit the REPL I get a very long error message ending:

``` WARNING: Error while freeing DeviceBuffer(568 bytes at 0x0000000320000400): CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing) Stacktrace: [1] throw_api_error(res::CUDA.cudaError_enum) @ CUDA ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27 [2] check @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined] [3] cuMemFreeAsync @ ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined] [4] #free#2 @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:97 [inlined] [5] free @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:92 [inlined] [6] #actual_free#976 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:77 [inlined] [7] actual_free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:74 [inlined] [8] #_free#998 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:492 [inlined] [9] _free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:479 [inlined] [10] macro expansion @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:464 [inlined] [11] macro expansion @ ./timing.jl:393 [inlined] [12] #free#997 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:463 [inlined] [13] free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:452 [inlined] [14] (::CUDA.var"#1004#1005"{CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CUDA.CuStream})() @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:130 [15] #context!#887 @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined] [16] context! @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined] [17] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream) @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129 [18] unsafe_finalize!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}) @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150 WARNING: Error while freeing DeviceBuffer(560 bytes at 0x0000000320000000): CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing) Stacktrace: [1] throw_api_error(res::CUDA.cudaError_enum) @ CUDA ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27 [2] check @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined] [3] cuMemFreeAsync @ ~/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined] [4] #free#2 @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:97 [inlined] [5] free @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/memory.jl:92 [inlined] [6] #actual_free#976 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:77 [inlined] [7] actual_free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:74 [inlined] [8] #_free#998 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:492 [inlined] [9] _free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:479 [inlined] [10] macro expansion @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:464 [inlined] [11] macro expansion @ ./timing.jl:393 [inlined] [12] #free#997 @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:463 [inlined] [13] free @ ~/.julia/packages/CUDA/35NC6/src/pool.jl:452 [inlined] [14] (::CUDA.var"#1004#1005"{CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, CUDA.CuStream})() @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:130 [15] #context!#887 @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 [inlined] [16] context! @ ~/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined] [17] unsafe_free!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}, stream::CUDA.CuStream) @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:129 [18] unsafe_finalize!(xs::CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}) @ CUDA ~/.julia/packages/CUDA/35NC6/src/array.jl:150 error in running finalizer: CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing) throw_api_error at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:27 check at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/libcuda.jl:34 [inlined] cuStreamDestroy_v2 at /home/js2430/.julia/packages/CUDA/35NC6/lib/utils/call.jl:26 [inlined] #834 at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/stream.jl:86 [inlined] #context!#887 at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:170 unknown function (ip: 0x7f08bc0a0880) context! at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/state.jl:165 [inlined] unsafe_destroy! at /home/js2430/.julia/packages/CUDA/35NC6/lib/cudadrv/stream.jl:85 unknown function (ip: 0x7f08bc0a0622) _jl_invoke at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gf.c:2758 [inlined] ijl_apply_generic at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gf.c:2940 run_finalizer at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:417 jl_gc_run_finalizers_in_list at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:507 run_finalizers at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/gc.c:553 ijl_atexit_hook at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/init.c:299 jl_repl_entrypoint at /cache/build/default-amdci5-2/julialang/julia-release-1-dot-9/src/jlapi.c:718 main at julia (unknown line) __libc_start_main at /lib64/libc.so.6 (unknown line) unknown function (ip: 0x401098) ```
jagoosw commented 1 year ago

Trying to make an MWE I can't reproduce the error without all of my code running so perhaps its not actually in the pressure solver even though that's where the error is being raised.

jagoosw commented 1 year ago

So in this I've got a load of update_tendencies! being called, and adding synchronize(device(architecture(model))) at the end appears to have fixed this.

To summarise:

glwagner commented 1 year ago

Do you know why the manual synchronize is needed?

jagoosw commented 1 year ago

No, I'll try making an MWE.

glwagner commented 1 year ago

Are all GPU operations KernelAbstractions? Or do you have other stuff sprinkled in?

jagoosw commented 1 year ago

All KernelAbstractions

Yixiao-Zhang commented 12 months ago

I found a similar problem (see #3320), but I am not sure whether it is related or not.

I do not know whether synchronize(device(architecture(model))) will solve my problem.