Closed telegraphic closed 3 years ago
A bit more context: the input data is a spectrum with shape (T x F), where T is number of timesteps (e.g. 16) and F is number of channels (e.g. 2^20).
So memory access is strided, and with an offset: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#strided-accesses
The dedoppler kernel is not the main bottleneck in the code, but would be nice to understand how much faster it could go.
A lot has changed since the Maxwell architecture. I'm not sure if storing data inside the texture memory would pay off in this use case. From my understanding, the main advantage of using this alternative memory is to free up bandwidth from the compute memory. This also might introduce an overhead of copying data from the CPU-mapped memory to texel memory. Shared memory appears to be a better alternative in this case.
AFAIK, main benefit of mapping through a texture object is (was?) that one can take advantage of a "free" conversion from 8 or 16 bit signed integer to 32-bit float. This reduces memory usage and is (presumably?) faster than a cast instruction, but it imposes some limits on input array size.
I found this paper explaining the method they use to implement de-dispersion on the GPU. They mention the texture memory on "Fermi" and "Pre-Fermi" GPUs. This is of course outdated but might be useful.
On pre-Fermi GPU hardware, the use of texture memory resulted in a speed-up of around 5× compared to using plain device memory, highlighting the importance of understanding the details of an algorithm’s memory access patterns when using these architectures. With the advent of Fermi-class GPUs, however, the situation has improved significantly. These devices contain an L1 cache that provides many of the advantages of using texture memory without having to explicitly refer to a special memory area. Using texture memory on Fermi-class GPUs was slightly slower than using plain device memory (with L1 cache enabled), as suggested in the CUDA programming guide.
I think the only way to know for sure which is the best memory topology is thru benchmarking.
@david-macmahon This is a very interesting use case. I didn't know about it!
Ok let's scrap texture memory, thanks @luigifcruz for the research!
This one is for a GPU-minded individual.
The memory access pattern will see multiple threads trying to read the same value. This sounds like a good place to use shared or texture memory: https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ http://cuda-programming.blogspot.com/2013/02/texture-memory-in-cuda-what-is-texture.html
Here's the kernel in question:
How can we speed this up when calling it from cupy? It looks like there is some support for raw kernels, but a little complex...