CUDA error on single GPU with --float-type=Float32 --panel-size=128. Possible OOM?

bloops commented 1 year ago

I'm trying to do a scaling analysis on a single GPU, but using --panel-size=128 leads to a CUDA error. I presume this might be an OOM since the code works for panel size up to 96.

However, I think the spectral element code would have arrays of size O(EN²) (or possibly O(EN⁴) intermediate arrays) where E = 6(panel size)² and N=4. This still amounts to <100MiB so is the OOM expected?

Here's the stacktrace. shallowwater_float32_panelsize128_stacktrace.txt

anudhyan_google_com@julia-multigpu-5:~/ClimaShallowWater.jl$ ./shallowwater_step --output-nsteps=10000 --float-type=Float32 --panel-size=128
┌ Info: Setting up experiment
│   device = ClimaComms.CUDA()
│   context = ClimaComms.SingletonCommsContext{ClimaComms.CUDA}(ClimaComms.CUDA())
│   testcase = ClimaShallowWater.SteadyStateTest(ClimaShallowWater.SphericalParameters(6.37122e6, 7.292e-5, 9.80616, 0.0015, 0.0), 38.61068276698372, 2998.1154702758267)
│   float_type = Float32
│   panel_size = 128
│   poly_nodes = 4
│   time_step = 360.0
│   time_end = 172800.0
│   approx_resolution = 24011.578f0
└   D₄ = 3.5606317f12
┌ Info: Saving state
│   n = 0
│   output_file = "output/state_000000.hdf5"
└   t = 0.0f0
ERROR: LoadError: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:89
  [2] macro expansion
    @ ~/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:97 [inlined]
  [3] cuLaunchKernel(f::CUDA.CuFunction, gridDimX::UInt32, gridDimY::UInt32, gridDimZ::UInt32, blockDimX::UInt32, blockDimY::UInt32, blockDimZ::UInt32, sharedMemBytes::Int64, hStream::CUDA.CuStream, kernelParams::Vector{Ptr{Nothing}}, extra::Ptr{Nothing})
    @ CUDA ~/.julia/packages/CUDA/BbliS/lib/utils/call.jl:26
  [4] #39
    @ ~/.julia/packages/CUDA/BbliS/lib/cudadrv/execution.jl:69 [inlined]
  [5] macro expansion
    @ ~/.julia/packages/CUDA/BbliS/lib/cudadrv/execution.jl:33 [inlined]
  [6] macro expansion
    @ ./none:0 [inlined]

 <snip>....

  [30] run_using_step(ARGS::Vector{String})
    @ ClimaShallowWater ~/ClimaShallowWater.jl/src/driver.jl:180
 [31] run_using_step()
    @ ClimaShallowWater ~/ClimaShallowWater.jl/src/driver.jl:177
 [32] top-level scope
    @ ~/ClimaShallowWater.jl/shallowwater_step:7
in expression starting at /home/anudhyan_google_com/ClimaShallowWater.jl/shallowwater_step:7

simonbyrne commented 1 year ago

I can replicate it with

using ClimaShallowWater, ClimaComms
context = ClimaComms.SingletonCommsContext(ClimaComms.CUDA())
ClimaComms.init(context)
testcase = ClimaShallowWater.SteadyStateTest()
space = ClimaShallowWater.create_space(
        context,
        testcase;
        float_type=Float32,
        panel_size=128,
        poly_nodes=4,
    )

Y = ClimaShallowWater.initial_condition(space, testcase)
p = ClimaShallowWater.auxiliary_state(Y, testcase)
ClimaShallowWater.dss!(Y, p)
dY = ClimaShallowWater.similar(Y)

ClimaShallowWater.tendency!(dY,Y,p,zero(FT))

simonbyrne commented 1 year ago

Ah, I think I figured it out. We launch the spectral element operators with the elements in the Y block: https://github.com/CliMA/ClimaCore.jl/blob/e7c7e9c245ab55aaf5dd2d2d121eacc5c5de7607/src/Operators/spectralelement.jl#L276 However this has a limit of 65535: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications which matches up with what @bloops saw:

julia> 128*128*6
98304

julia> 96*96*6
55296

We can fix this in ClimaCore by changing the order of the blocks.

CliMA / ClimaShallowWater.jl

CUDA error on single GPU with --float-type=Float32 --panel-size=128. Possible OOM? #16