fjarri / reikna

Pure Python GPGPU library
http://reikna.publicfields.net/
MIT License
164 stars 16 forks source link

2d FFT with fast_math: roundtrip fails on GT 750M #28

Closed maweigert closed 7 years ago

maweigert commented 7 years ago

Hi Bogdan,

I was porting some fft based code from pyfft to reikna and was experiencing some inaccuracies in the fft calculations with fast_math, depending on the hardware I am using.

I did the following simple roundtrip comparison

https://gist.github.com/maweigert/0bb5d16b3bb9a3d0659c7d48ee8fd32a

and got very different behaviour depending on the GPU:

Iris Pro:           0.0001746
GeForce GT 750M:        1.1197116

While pyfft on the same input (and fast_math = True) gave

GeForce GT 750M:        0.0000043

So it seems not to be GPU but reikna specific.

Did you ever see something similar, or can you reproduce this?

Cheers and thanks for the package!

M

fjarri commented 7 years ago

Thanks for the report. That's strange, I cannot reproduce it on my old laptop with geForce 9400M (and I get an error of 9e-7). I will have access to my mac with GF750M next week when I'm at my office, so I can try it out too. Could you tell me which version/revision of reikna you are using?

Also, try the following reikna-only code:

from __future__ import print_function
import numpy as np
from reikna import cluda
from reikna.fft import FFT

dshape = (128,)*2
np.random.seed(0)
input = (np.random.uniform(-1,1,dshape)).astype(np.complex64)

thr = cluda.ocl_api().Thread.create(interactive=True)
buf_g = thr.to_device(input)
fft = FFT(buf_g).compile(thr, fast_math = True)

fft(buf_g, buf_g)
fft(buf_g, buf_g, inverse = True)

output = buf_g.get()

print("{}:\t\t{}".format(thr._device.name, np.amax(np.abs(input-output))))

Also, could you do some more tests?

maweigert commented 7 years ago

Thanks for looking into that!

fjarri commented 7 years ago

Yes, it is quite strange. Removing the natice_cos()/sin() usage pretty much negates any performance benefit from fast_math=True, so would rather not do that.

I suspect there may be some bug in Apple's OpenCL driver (I have found several over the years myself). It is usually some kind of strange interplay between the exact GPU operations invoked and the global/local size. My general approach in such cases is to isolate the offending kernel and start removing parts until I end up with something that reproduces the bug and is small enough to open an issue in the Apple's tracker. It is a quite lengthy process, though, and I completely understand if you don't want to go through it.

I have tested the code on OSX 10.11.3, and could not reproduce the bug, but it was a FirePro video card, so the local sizes used could be different. Could you do several more things:

  1. Comment the #if block in cluda/kernel.mako starting from #if defined(cl_khr_fp64). This seems to be one of the differences from pyfft, which only enables it when the array has a double-precision datatype.

  2. Check and tell me which global/local sizes reikna and pyfft use (let's say for the smallest array when the bug is reproduced, that is the 1D one with 1024 elements). For the reikna code, add the following lines:

    for call in fft._kernel_calls:
        print(call._kernel.global_size, call._kernel.local_size)

    For pyfft code (add after the actual call, since the kernels are created on the first invocation):

    for k in plan._kernels:
        print k._func_forward._global_size, k._func_forward._block_size
maweigert commented 7 years ago

I suspect there may be some bug in Apple's OpenCL driver (I have found several over the years myself)

Indeed, that was it!! After installing the Nvidia Web drivers (346.03.15f02) everything was fine again. So it seem the default drivers on El Capitan (310.42.25f01) have a bug in the native_sin/cos functions.

Thanks for your help!