Constructing shared memory on the CPU should fail

robertstrauss commented 3 years ago

Describe the bug

Accessing an index of an array in shared memory (allocated outside) a kernel throws an illegal memory access error. If a cuDeviceArray (array in GPU shared memory) is passed to a simple kernel setting (or getting) its values, this error is encountered. The error doesn't show itself till after the kernel is compiled and run and any other CUDA operation is performed (another kernel run, or just CUDA.synchronize()).

To reproduce


import CUDA

function kernel(arr)
        i = CUDA.threadIdx().x
        arr[i] = 1.0
        return
end

arr = rand(20)

arrshared = CUDA.@cuStaticSharedMem(eltype(arr), size(arr))

copyto!(arrshared, arr)

CUDA.@cuda threads=20 kernel(arrshared)

CUDA.synchronize() # needed to make error message show

throws the error:

ERROR: CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/.julia/packages/CUDA/02Kjq/lib/cudadrv/error.jl:105
 [2] query
   @ ~/.julia/packages/CUDA/02Kjq/lib/cudadrv/stream.jl:102 [inlined]
 [3] synchronize(stream::CUDA.CuStream; blocking::Bool)
   @ CUDA ~/.julia/packages/CUDA/02Kjq/lib/cudadrv/stream.jl:117
 [4] synchronize (repeats 2 times)
   @ ~/.julia/packages/CUDA/02Kjq/lib/cudadrv/stream.jl:117 [inlined]
 [5] top-level scope
   @ ~/.julia/packages/CUDA/02Kjq/src/initialization.jl:54

Manifest.toml

Manifest.toml: ``` [[CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"] git-tree-sha1 = "8ef71bf6d6602cf227196b43650924bf9ef7babc" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "3.3.3" ``` Device: NVIDIA GTX 1080 OS: Ubuntu 18.04 LTS

Expected behavior

The shared memory can be allocated outside the kernel and used inside it to keep the data in shared rather than global memory. No error is thrown, the kernel successfully sets the value of each index of the shared array.

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-9700K CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Details on CUDA:

julia> CUDA.versioninfo()

CUDA toolkit 11.3.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.80.0

Libraries: 
- CUBLAS: 11.5.1
- CURAND: 10.2.4
- CUFFT: 10.4.2
- CUSOLVER: 11.1.2
- CUSPARSE: 11.6.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.80
- CUDNN: 8.20.0 (for CUDA 11.3.0)
- CUTENSOR: 1.3.0 (for CUDA 11.2.0)

Toolchain:
- Julia: 1.6.1
- LLVM: 11.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: GeForce GTX 1080 (sm_61, 7.442 GiB / 7.921 GiB available)

(@v1.6) pkg> status
        Status `~/.julia/environments/v1.6/Project.toml`
     [052768ef] CUDA v3.3.3

Am I using shared memory wrong? The same code works if arrshared is just a CuArray (in global memory rather than shared). Could this be a issue specifically with my device? This isn't specific to setting the value, getindex will give the same error as setindex!, here.

vchuravy commented 3 years ago

You need to call @cuStaticSharedMemory inside the kernel not outside it.

robertstrauss commented 3 years ago

Thanks for the reply. Wouldn't allocating and moving data to shared memory inside the kernel nullify any performance gain of using shared memory? I tried to move the array to shared memory first before calling the kernel to speed things up so it doesn't have to go all the way to global memory. Am I misunderstanding something about how shared memory is supposed to be used? Or do I just have to move it to shared with one kernel function, but not at the start of every single other kernel function?

robertstrauss commented 3 years ago

What I'm trying to do is this: I have a simulation, where the state is some array in global memory. I want to perform many time steps on the GPU before copying the output back to the CPU. Since I'm calling the same time-step kernel multiple times, I thought it could be more efficient to move the data from global to shared memory beforehand, since it will be read and written to many times. Is this something shared memory is useful for?

maleadt commented 3 years ago

I thought it could be more efficient to move the data from global to shared memory beforehand, since it will be read and written to many times.

That's not how shared memory works. It's a buffer on the streaming multiprocessor a block is executing on, so it's only available when your kernel is actually executing, and not beforehand. And you need to populate it from every block.

I'm surprised this doesn't fail. We should probably use device overrides to protect against calling this functionality from the CPU.

robertstrauss commented 3 years ago

Ah, I see now. Thanks for removing the bug label.

maleadt commented 3 years ago

It's still kind of a bug :slightly_smiling_face: But I like to keep the label for more pressing bugs, whereas this is more of a misuse of the API that should be disallowed.

JuliaGPU / CUDA.jl

Constructing shared memory on the CPU should fail #1047