Open robertstrauss opened 3 years ago
You need to call @cuStaticSharedMemory
inside the kernel not outside it.
Thanks for the reply. Wouldn't allocating and moving data to shared memory inside the kernel nullify any performance gain of using shared memory? I tried to move the array to shared memory first before calling the kernel to speed things up so it doesn't have to go all the way to global memory. Am I misunderstanding something about how shared memory is supposed to be used? Or do I just have to move it to shared with one kernel function, but not at the start of every single other kernel function?
What I'm trying to do is this: I have a simulation, where the state is some array in global memory. I want to perform many time steps on the GPU before copying the output back to the CPU. Since I'm calling the same time-step kernel multiple times, I thought it could be more efficient to move the data from global to shared memory beforehand, since it will be read and written to many times. Is this something shared memory is useful for?
I thought it could be more efficient to move the data from global to shared memory beforehand, since it will be read and written to many times.
That's not how shared memory works. It's a buffer on the streaming multiprocessor a block is executing on, so it's only available when your kernel is actually executing, and not beforehand. And you need to populate it from every block.
I'm surprised this doesn't fail. We should probably use device overrides to protect against calling this functionality from the CPU.
Ah, I see now. Thanks for removing the bug label.
It's still kind of a bug :slightly_smiling_face: But I like to keep the label for more pressing bugs, whereas this is more of a misuse of the API that should be disallowed.
Describe the bug
Accessing an index of an array in shared memory (allocated outside) a kernel throws an illegal memory access error. If a cuDeviceArray (array in GPU shared memory) is passed to a simple kernel setting (or getting) its values, this error is encountered. The error doesn't show itself till after the kernel is compiled and run and any other CUDA operation is performed (another kernel run, or just CUDA.synchronize()).
To reproduce
throws the error:
Manifest.toml
Manifest.toml: ``` [[CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "DataStructures", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"] git-tree-sha1 = "8ef71bf6d6602cf227196b43650924bf9ef7babc" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "3.3.3" ``` Device: NVIDIA GTX 1080 OS: Ubuntu 18.04 LTS
Expected behavior
The shared memory can be allocated outside the kernel and used inside it to keep the data in shared rather than global memory. No error is thrown, the kernel successfully sets the value of each index of the shared array.
Version info
Details on Julia:
Details on CUDA:
Am I using shared memory wrong? The same code works if arrshared is just a CuArray (in global memory rather than shared). Could this be a issue specifically with my device? This isn't specific to setting the value,
getindex
will give the same error assetindex!
, here.