DTolm / VkFFT

Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library
MIT License
1.52k stars 90 forks source link

Intel HD graphics issues (C2C launch failure and R2C calculation issue) #50

Open vincefn opened 2 years ago

vincefn commented 2 years ago

Hi @DTolm, as promised here is a report on issues with Intel HD graphics. This can be due to issues on the Intel GPUs (which are used for display as well), rather than with an issue with VkFFT, but I guess it's good to document that as an issue.

The first test was done with a systematic accuracy test with 1D, single precision, C2C radix-2&3 transforms up to N=2**18:

Checking the launch result code I get a VKFFT_ERROR_FAILED_TO_LAUNCH_KERNEL (4039), with or without LUT.

The second test used a 1D R2C transform on a 3D real array of size (32, 32, 32+2), using a LUT. On the Iris graphics 6100 machine, the following data is obtained after an R2C+C2R, looking either at the average or max difference along the z dimension, or some specific layers: image

When the calculation is repeated, the errors change (the original array max is around 11 so the above differences are really very high).

On other GPU (CUDA) I have no accuracy issues, but with these Intel graphics pyvkfft's R2C unit tests fail.

For the R2C case I do not get a launch error. The code used for the figure is:

import pyopencl as cl
import pyopencl.array as cla
import pyvkfft.opencl
from pyvkfft.opencl import VkFFTApp
import matplotlib.pyplot as plt
import numpy as np

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Create some context on the first available GPU
# Find the first OpenCL GPU available and use it, unless
for p in cl.get_platforms():
    for d in p.get_devices():
        if d.type & cl.device_type.GPU == 0:
            continue
        print("Selected device: ", d.name)
        ctx = cl.Context(devices=(d,))
        break
    if ctx is not None:
        break
cq = cl.CommandQueue(ctx)

dims = 3
ndim = 1
norm = 1
n = 32
sh = [n] * dims
for i in range(ndim, dims):
    sh[-i-1] = n
sh[-1] += 2
d0 = np.random.uniform(0, 1, sh)
# A pure random array may not be a very good test (too random),
# so add a Gaussian
xx = [np.fft.fftshift(np.fft.fftfreq(nx)) for nx in sh]
v = np.zeros_like(d0)
for x in np.meshgrid(*xx, indexing='ij'):
    v += x ** 2
d0 += 10 * np.exp(-v * 2)

d = cla.to_device(cq, d0.astype(np.float32))
app = VkFFTApp(d.shape, d.dtype, queue=cq, ndim=ndim, norm=norm, r2c=True, useLUT=True)

d = app.fft(d) * app.get_fft_scale()

d = app.ifft(d) * app.get_ifft_scale()

plt.figure(figsize=(13,3))
plt.subplot(141)
plt.imshow(abs(d.get() - d0)[...,:-2].mean(axis=0))
plt.title("3D R2C+C2R diff 2D mean")
plt.colorbar()
plt.subplot(142)
plt.imshow(abs(d.get() - d0)[...,:-2].max(axis=0))
plt.title("3D R2C+C2R diff 2D max")
plt.colorbar()
plt.subplot(143)
plt.imshow(abs(d.get() - d0)[0,:,:-2])
plt.title("3D R2C+C2R diff z=0")
plt.colorbar()
plt.subplot(144)
plt.imshow(abs(d.get() - d0)[-1,:,:-2])
plt.title("3D R2C+C2R diff z=1")
plt.colorbar()

print((abs(d0[...,:-2])**2).sum(), (abs(d.get()[...,:-2])**2).sum())
plt.tight_layout()
DTolm commented 2 years ago

Hello,

The first issue should be resolved (at least for sizes N=55296 and 248832).

I could not confirm the second issue, here are the results I am getting on UHD610 by running your script, which seem to be in the acceptable range. Figure_1

vincefn commented 2 years ago

Hi @DTolm, - this is just to confirm that the issue seems fixed on HD graphics 5000 and UHD630 in my tests, but not on an Iris Graphics 6100. This GPU is much less powerful and not so interesting for computing, so it's not a huge issue. It could be an Iris bug as well.