fjarri / reikna

Pure Python GPGPU library
http://reikna.publicfields.net/
MIT License
164 stars 16 forks source link

prepare_for is slow #8

Closed Nodd closed 11 years ago

Nodd commented 11 years ago

In my algorithm I have to do FFT to many different blocks (same size but different data). Here is a basic script reproducing the setup:

import numpy as np
from reikna import cluda
from reikna.fft import FFT

@profile
def main():
    api = cluda.ocl_api()
    thr = api.Thread.create()

    for n in range(1000):
        data_in = np.random.rand(256) + 1j*np.random.rand(256)
        cl_data_in = thr.to_device(data_in)
        cl_data_out = thr.empty_like(cl_data_in)
        fft_thr = FFT(thr)

        fft = fft_thr.prepare_for(cl_data_out, cl_data_in, -1, axes=(0,))
        fft(cl_data_out, cl_data_in, -1)

        data_out = np.fft.fft(data_in, axis=0)

        if np.max(np.abs(data_out - cl_data_out.get())) > 1e-5:
            print "Wrong result"

    print "Done."

if __name__ == "__main__":
    main()

When profiling with line_profiler, I see that the line including prepare_for takes more than 95% of the computing time:

$ kernprof.py -lv ./test_reinka.py
Done.
Wrote profile results to test_reinka.py.lprof
Timer unit: 1e-06 s

File: ./test_reinka.py
Function: main at line 7
Total time: 22.3035 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     7                                           @profile
     8                                           def main():
     9         1        66307  66307.0      0.3      api = cluda.ocl_api()
    10         1        59075  59075.0      0.3      thr = api.Thread.create()
    11
    12      1001         2064      2.1      0.0      for n in range(1000):
    13      1000        30907     30.9      0.1          data_in = np.random.rand(256) + 1j*np.random.rand(256)
    14      1000       120048    120.0      0.5          cl_data_in = thr.to_device(data_in)
    15      1000        45705     45.7      0.2          cl_data_out = thr.empty_like(cl_data_in)
    16      1000        94386     94.4      0.4          fft_thr = FFT(thr)
    17
    18      1000     21541755  21541.8     96.6          fft = fft_thr.prepare_for(cl_data_out, cl_data_in, -1, axes=(0,))
    19      1000       124706    124.7      0.6          fft(cl_data_out, cl_data_in, -1)
    20
    21      1000        36685     36.7      0.2          data_out = np.fft.fft(data_in, axis=0)
    22
    23      1000       181866    181.9      0.8          if np.max(np.abs(data_out - cl_data_out.get())) > 1e-5:
    24                                                       print "Wrong result"
    25
    26         1           36     36.0      0.0      print "Done."

You can see that it's way slower than the basic numpy fft.

This leads me to some questions:

fjarri commented 11 years ago

Is it a bug or am I using it wrong ?

FFT, unlike other computations in reikna, has a huge template and a lot of helper logic to render it, so 20ms per preparation (plus the first call to CUDA compiler, unless you ran this file before at least once) seems plausible.

Do I have to create a new computation if the size of the arrays are the same, but the data is different ?

No, if arrays (their shape and dtype, not data) stay the same, computation is intended to be reused. It behaves pretty much the same way as the old PyFFT plan.

Is there a forum somewhere where I could ask questions about reinka ? (stackoverflow ?)

Since reikna has exactly three users now, including me, you can just write me e-mails :) I'll think about organizing a maillist or some sort of forum though, just to be optimistic.

Nodd commented 11 years ago

Thank you, I understand better now. Just a precision on your second answer, only the size and dtype counts for the arrays, the don't need to be at the same address in memory ? I find the fact that you pass arrays to prepare_for is misleading, as it gives the impression that it prepares the computation for those particular arrays.giving the size and dtype should be enough.

Nodd commented 11 years ago

Here's a better version:

def main():
    api = cluda.ocl_api()
    thr = api.Thread.create()

    N = 256
    M = 10000

    data_in = np.random.rand(N) + 1j*np.random.rand(N)
    cl_data_in = thr.to_device(data_in)
    cl_data_out = thr.empty_like(cl_data_in)
    fft = FFT(thr).prepare_for(cl_data_out, cl_data_in, -1, axes=(0,))

    for n in range(M):
        fft(cl_data_out, cl_data_in, -1)
        np.fft.fft(data_in, axis=0)

    print "Done."

The time for prepare_for can range from 20ms to 2s. I guess that it's due to the compilation when I change the data size.

I notice also that the GPU fft is faster than the numpy fft only for array sizes greater than 4096 (on my computer). It fells slow too, is it normal ?

Nodd commented 11 years ago

My mistake, I meant to have np.random.rand(N, N) instead of np.random.rand(N). My expectations were wrong, sorry.

fjarri commented 11 years ago

Just a precision on your second answer, only the size and dtype counts for the arrays, the don't need to be at the same address in memory?

No, only size and dtype matter.

I find the fact that you pass arrays to prepare_for is misleading, as it gives the impression that it prepares the computation for those particular arrays.giving the size and dtype should be enough.

You do not have to pass arrays per se, any object with shape and dtype attributes will do. The idea was that you would have arrays lying around anyway, so instead of extracting and passing shapes and dtypes (and, in the future, strides) for every one of them, you can just pass the array itself. Basically it is a slight extension of numpy.empty_like() design.

I notice also that the GPU fft is faster than the numpy fft only for array sizes greater than 4096 (on my computer). It fells slow too, is it normal?

It is plausible, although I haven't worked with small arrays much (by the nature of my work I usually have 100-200Mb of data at once to FFT), so I don't have much experience. It shouldn't be worse than PyFFT though, because the kernel is pretty much the same, and the Python wrapper around __call__ is thinner.

That said, there may very well be some space for optimization, but I'm mostly focusing on stabilizing the core API now, and leaving computations for later.

fjarri commented 11 years ago

I guess the problem is resolved. Closing.