CODE OPTIMIZATION: Figure out how to use shared / texture memory for dedoppler kernel

telegraphic commented 3 years ago

This one is for a GPU-minded individual.

The memory access pattern will see multiple threads trying to read the same value. This sounds like a good place to use shared or texture memory: https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/ http://cuda-programming.blogspot.com/2013/02/texture-memory-in-cuda-what-is-texture.html

Here's the kernel in question:

extern "C" __global__
    __global__ void dedopplerKernel
        (const float *data, float *dedopp, int *shift, int F, int T)
        /* Each thread computes a different dedoppler sum for a given channel

         F: N_frequency channels
         T: N_time steps

         *data: Data array, (T x F) shape
         *dedopp: Dedoppler summed data, (D x F) shape
         *shift: Array of doppler corrections of length D.
                 shift is total number of channels to shift at time T
        */
        {

        // Setup thread index
        const int tid = blockIdx.x * blockDim.x + threadIdx.x;
        const int d   = blockIdx.y;   // Dedoppler trial ID
        const int D   = gridDim.y;   // Number of dedoppler trials
        // Index for output array
        const int dd_idx = d * F + tid;
        float dd_val = 0;

        int idx = 0;
        for (int t = 0; t < T; t++) {
                            // timestep    // dedoppler trial offset
            idx  = tid + (F * t)      + (shift[d] * t / T);
            if (idx < F * T && idx > 0) {
                dd_val += data[idx];
              }
              dedopp[dd_idx] = dd_val;
            }
        }

How can we speed this up when calling it from cupy? It looks like there is some support for raw kernels, but a little complex...

telegraphic commented 3 years ago

A bit more context: the input data is a spectrum with shape (T x F), where T is number of timesteps (e.g. 16) and F is number of channels (e.g. 2^20).

So memory access is strided, and with an offset: https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#strided-accesses

The dedoppler kernel is not the main bottleneck in the code, but would be nice to understand how much faster it could go.

luigifcruz commented 3 years ago

A lot has changed since the Maxwell architecture. I'm not sure if storing data inside the texture memory would pay off in this use case. From my understanding, the main advantage of using this alternative memory is to free up bandwidth from the compute memory. This also might introduce an overhead of copying data from the CPU-mapped memory to texel memory. Shared memory appears to be a better alternative in this case.

david-macmahon commented 3 years ago

AFAIK, main benefit of mapping through a texture object is (was?) that one can take advantage of a "free" conversion from 8 or 16 bit signed integer to 32-bit float. This reduces memory usage and is (presumably?) faster than a cast instruction, but it imposes some limits on input array size.

luigifcruz commented 3 years ago

I found this paper explaining the method they use to implement de-dispersion on the GPU. They mention the texture memory on "Fermi" and "Pre-Fermi" GPUs. This is of course outdated but might be useful.

On pre-Fermi GPU hardware, the use of texture memory resulted in a speed-up of around 5× compared to using plain device memory, highlighting the importance of understanding the details of an algorithm’s memory access patterns when using these architectures. With the advent of Fermi-class GPUs, however, the situation has improved significantly. These devices contain an L1 cache that provides many of the advantages of using texture memory without having to explicitly refer to a special memory area. Using texture memory on Fermi-class GPUs was slightly slower than using plain device memory (with L1 cache enabled), as suggested in the CUDA programming guide.

I think the only way to know for sure which is the best memory topology is thru benchmarking.

luigifcruz commented 3 years ago

@david-macmahon This is a very interesting use case. I didn't know about it!

telegraphic commented 3 years ago

Ok let's scrap texture memory, thanks @luigifcruz for the research!

UCBerkeleySETI / hyperseti

CODE OPTIMIZATION: Figure out how to use shared / texture memory for dedoppler kernel #24