GPU acceleration - Githubissues

HDictus commented 5 years ago

It seems it should be possible to gpu-accelerate the convolution in DSP.jl, as there are FFTs that are defined for versions of GPUArrays, and elementwise multiplication certainly is. I don't know how we'd need to change the padding, if at all, or whether overlap-save is feasible(or desireable for that matter) on GPU. In any case, we can make a dispatch for any GPUArrays, that should dispatch the appropriate fft to the respective GPUArray subtypes, right? Only thing is I remember reading planned FFTs aren't defined for GPUArrays (or at least not CuArrays). We currently have slightly different procedures for fft-ing and ifft-ing real and complex arrays, so adding more for GPUArrays could be bundled with these in a function that handles dispatching arrays to the appropriate FFTs.

I don't plan to work on this any time soon, but I thought I'd write it down while it's on my mind. Don't mind in the least if someone swoops in first.

btmit commented 4 years ago

I've started thinking about this a bit, at least in the context of overslap-save convolution. The way I think about efficient linear convolution is break it down into a number of circular convolutions. To be GPU friendly, you probably want to execute everything as a large single operation. If you're filtering a long 1D data input with a short 1D reference, the simple 1D problem would look like

#            time domain buffered array           frequency domain zeropad reference
#           scrapped output |  valid output       
#          |---------------------------------|    |---------------------------------|
#          | data[oind,M-1] |   data[:,M]    |    |           fd_pad_ref            |
# ifft(fft(| data[oind,2]   |                |) * |           fd_pad_ref            |)
#          | data[oind,1]   |   data[:,2]    |    |           fd_pad_ref            |
#          |   residual     |   data[:,1]    |    |           fd_pad_ref            |
#          |---------------------------------|    |---------------------------------|

So a 1D problem becomes simple linear algebra on a 2D array, provided the data samples are reshaped and copied to the correct locations in the buffered array. In this in-place formulation, the data input and output reside as the same location in the buffered array. - note that the ffts are 1D in this example.

I'd be curious what @galenlynch thinks of this dimensionality augmentation approach and how it could scale to a more general and N-D implementation. This seems fundamentally different than the current approach in some ways (but I could be wrong). The indexing needed to handle edges, faces, corners, etc in N dimensions still blows my mind a bit. Eventually we'd probably need a custom kernel to handle those data copies in parallel, but the rest would be straightforward.

some pseudocode: I'm assuming there's some filtering structure f that stores precomputed values, allocated memory, and intermediate values between calls. I've also left out the indexing details.

# break data into chunks
data = reshape(in, :, nchunks)
# copy to valid output part of bufArray
f.bufArray[vreg] = data  #? can we avoid this copy (# TODO: investigate CatViews)
# copy to scrapped output part of bufArray
f.bufArray[sreg] = data[oind, 1:(end-1),]
# save the last data chunk
f.residual = data[oind, end]  

# linear convolution of entire array
f.pf! * f.bufArray # forward 1D fft
f.bufArray .*= f.fd_filt # Perform 1D filtering on 2D array
f.pb! * f.bufArray # backward 1D fft

#copy residual
f.bufArray[rreg] = rd.residual

# Copy valid data out of bufArray
data[:] = rd.bufArray[vreg]  #? this copy would then be avoided too?

galenlynch commented 4 years ago

Thanks for the thoughtful post @btmit. I'm in lab today but I'll read it in detail soon.

galenlynch commented 4 years ago

Sorry that I took so long to get this, @btmit. I think what you outlined is quite similar to what is currently being done, though it's done serially instead of all at once, with each row of your matrix diagram above.

Is the point that you're trying to make that by formulating it as a matrix you wouldn't have to add the parallelism yourself, as long as there is already parallelism built into Fourier transformations and multiplication of matrices? I attempted to reuse buffers for both the time-domain data being transformed, as well as the Fourier-domain transformation result, as much as possible in the current implementation. If you added parallelism to this then I think you only need to allocate one set of buffers per thread. One possible downside to making everything a big linear algebra problem would be dramatically increasing the amount of buffer space required. If I understand it correctly, a strictly linear algebra formulation would require buffers to be allocated in proportion to the entire input signal (for both the time domain and fourier domain steps), regardless of how many threads were running.

btmit commented 4 years ago

@galenlynch Thank you for taking a look. Your observations are almost entirely correct.

The simple example above does require more memory than a serial approach, specifically by the same factor N that describes the degree of parallelism.
The example does share the same buffer for the time domain, frequency domain, as well as the result - as you do for complex data.
You are correct that matrix operations and ffts are inherenty parallelized through the CuArrays package's wrapping of the CuBLAS and CuFFT libraries. More generally, each "row operation" is completely independent in the picture above and can be parallelized accordingly.

The one aspect I'll offer a different view on is that this approach doesn't necessarily require the entire signal to be broken up and tackled at once. The data can still be operated on in discrete batch sizes of your choosing (or dictated by the degrees of parallelism available.) You then perform the bulk operation I lay out on these batches serially. So long as you maintain the residual between runs the solution should be correct. In the limit of a single degree of parallelism, the operation reduces to a "row" and the implementation should be largely identical to yours.

I still have not wrapped my head completely around the extension to N-D convolution, but the GPU libraries should support it generally. I bet you can visualize how this would work far better than I.

JuliaDSP / DSP.jl

GPU acceleration #301