Closed Nodd closed 11 years ago
Is it a bug or am I using it wrong ?
FFT, unlike other computations in reikna, has a huge template and a lot of helper logic to render it, so 20ms per preparation (plus the first call to CUDA compiler, unless you ran this file before at least once) seems plausible.
Do I have to create a new computation if the size of the arrays are the same, but the data is different ?
No, if arrays (their shape and dtype, not data) stay the same, computation is intended to be reused. It behaves pretty much the same way as the old PyFFT plan.
Is there a forum somewhere where I could ask questions about reinka ? (stackoverflow ?)
Since reikna has exactly three users now, including me, you can just write me e-mails :) I'll think about organizing a maillist or some sort of forum though, just to be optimistic.
Thank you, I understand better now. Just a precision on your second answer, only the size and dtype counts for the arrays, the don't need to be at the same address in memory ? I find the fact that you pass arrays to prepare_for is misleading, as it gives the impression that it prepares the computation for those particular arrays.giving the size and dtype should be enough.
Here's a better version:
def main():
api = cluda.ocl_api()
thr = api.Thread.create()
N = 256
M = 10000
data_in = np.random.rand(N) + 1j*np.random.rand(N)
cl_data_in = thr.to_device(data_in)
cl_data_out = thr.empty_like(cl_data_in)
fft = FFT(thr).prepare_for(cl_data_out, cl_data_in, -1, axes=(0,))
for n in range(M):
fft(cl_data_out, cl_data_in, -1)
np.fft.fft(data_in, axis=0)
print "Done."
The time for prepare_for can range from 20ms to 2s. I guess that it's due to the compilation when I change the data size.
I notice also that the GPU fft is faster than the numpy fft only for array sizes greater than 4096 (on my computer). It fells slow too, is it normal ?
My mistake, I meant to have np.random.rand(N, N) instead of np.random.rand(N). My expectations were wrong, sorry.
Just a precision on your second answer, only the size and dtype counts for the arrays, the don't need to be at the same address in memory?
No, only size and dtype matter.
I find the fact that you pass arrays to prepare_for is misleading, as it gives the impression that it prepares the computation for those particular arrays.giving the size and dtype should be enough.
You do not have to pass arrays per se, any object with shape
and dtype
attributes will do. The idea was that you would have arrays lying around anyway, so instead of extracting and passing shapes and dtypes (and, in the future, strides) for every one of them, you can just pass the array itself. Basically it is a slight extension of numpy.empty_like()
design.
I notice also that the GPU fft is faster than the numpy fft only for array sizes greater than 4096 (on my computer). It fells slow too, is it normal?
It is plausible, although I haven't worked with small arrays much (by the nature of my work I usually have 100-200Mb of data at once to FFT), so I don't have much experience. It shouldn't be worse than PyFFT though, because the kernel is pretty much the same, and the Python wrapper around __call__
is thinner.
That said, there may very well be some space for optimization, but I'm mostly focusing on stabilizing the core API now, and leaving computations for later.
I guess the problem is resolved. Closing.
In my algorithm I have to do FFT to many different blocks (same size but different data). Here is a basic script reproducing the setup:
When profiling with line_profiler, I see that the line including prepare_for takes more than 95% of the computing time:
You can see that it's way slower than the basic numpy fft.
This leads me to some questions: