CUDA_ERROR_ILLEGAL_ADDRESS with Genred

Hello there,

I am trying to implement the Non Uniform Fourier Transform on GPU using PyKeops.

Here is a minimal (not) working example:


import pykeops
import numpy as np
from pykeops.numpy import Genred

pykeops.clean_pykeops()          # just in case old build files are still present
np.random.seed(0)

N = 64
M = 100
shape = (N, N, N )

# Generate non uniform points in [-0.5, 0.5]
samples = np.random.rand(M, len(shape))
samples /= samples.max()
samples -= 0.5

samples *= 2 * np.pi  # Scaling to match exp(-i*(2*pi*k)*x) 

# Generate Uniform points (image voxel grid)
# final shape is (1,N*N*N)
locs = np.ascontiguousarray(np.array(np.meshgrid(*[np.arange(s) for s in shape],indexing="ij")).reshape(len(shape), -1).T.astype(np.float32))

# Create Fake Image data
image = np.random.randn(*shape) + 1j* np.random.randn(*shape)
image = image.astype(np.complex64)

# Create Fake Coeffs data (in k-space)
kspace = np.random.randn(M) + 1j * np.random.randn(M)
kspace = kspace.astype(np.complex64)

# Initialize pykeops kernels

variables = ["x_j = Vj(1,{dim})","nu_i = Vi(0,{dim})",  "b_j = Vj(2,2)"]
aliases = [ s.format(dim=len(shape)) for s in variables]
forward_op = Genred(
    "ComplexMult(ComplexExp1j(- nu_i | x_j),  b_j)",
    aliases,
    reduction_op="Sum",
    axis=1,
)
# Calls Forward
coeff_cpu = forward_op(samples, locs, image.reshape(-1,1), backend="CPU").view(np.complex64)

coeff_gpu = forward_op(samples, locs, image.reshape(-1,1), backend="GPU").view(np.complex64)

assert np.allclose(coeff_gpu, coeff_cpu)

Whenever I ran this with the GPU backend, I get the following traceback:

python test_pykeops.py 
[KeOps] /home/pc266769/.cache/keops2.1.2/Linux_is245233_6.2.0-31-generic_p3.10.12 has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
[KeOps] Generating code for formula Sum_Reduction(ComplexMult(ComplexExp1j(-Var(0,3,0)|Var(1,3,1)),Var(2,2,1)),0) ... OK
[pyKeOps] Compiling pykeops cpp df83e61aa0 module ... OK
(100, 2)
[KeOps] Generating code for formula Sum_Reduction(ComplexMult(ComplexExp1j(-Var(0,3,0)|Var(1,3,1)),Var(2,2,1)),0) ... OK

[KeOps] error: cuCtxSynchronize() failed with error CUDA_ERROR_ILLEGAL_ADDRESS

Traceback (most recent call last):
  File "/home/pc266769/gits/nsp/scripts/test_pykeops.py", line 62, in <module>
    coeff_gpu = forward_op(samples, locs, image.reshape(-1,1), backend="GPU").view(np.complex64)
  File "/volatile/pierre-antoine/.pyenv/versions/fmri/lib/python3.10/site-packages/pykeops/numpy/generic/generic_red.py", line 347, in __call__
    out = self.myconv.genred_numpy(-1, ranges, nx, ny, nbatchdims, out, *args)
  File "/volatile/pierre-antoine/.pyenv/versions/fmri/lib/python3.10/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 230, in genred
    self.call_keops(nx, ny)
  File "/volatile/pierre-antoine/.pyenv/versions/fmri/lib/python3.10/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 42, in call_keops
    self.launch_keops(
RuntimeError: [KeOps] Cuda error.

[KeOps] error: cuMemFree(buffer) failed with error CUDA_ERROR_ILLEGAL_ADDRESS

My GPU has 16Gb of VRAM and given the small size I am using I am not expecting to get an overflow.

I am doing something wrong with the Genred ? Could complex array be the culprit ?

Thank in advance!

Hello @paquiteau , Sorry for the late answer. I tried your script and there are two things to change in order to make it work:

first your first input samples was not converted to float32, it was still in float64, so you need to convert.

also in fact you need to pass only real tensors when using the Genred syntax, so you need to convert the image tensor to real type. Here is the updated last part of your script:

samples = samples.astype(np.float32)
image = image.flatten().view("(2,)float32")
# Calls Forward
coeff_cpu = forward_op(samples, locs, image, backend="CPU").view(np.complex64)
coeff_gpu = forward_op(samples, locs, image, backend="GPU").view(np.complex64)

Now, there is another issue with accuracy. When testing, you will see that it runs, but the np.allclose assertion will fail at the end. This is due to accuracy errors when using float32. If you compute the absolute and relative errors as follows:

print(np.linalg.norm(coeff_gpu-coeff_cpu))
print(np.linalg.norm(coeff_gpu-coeff_cpu)/np.linalg.norm(coeff_cpu))

it will give something around 5e-2 for absolute error, and around 5e-6 for relative error. The "normal" loss of accuracy should be around 10 times smaller (I checked it by randomly permuting the input data). This bad accuracy is due to the use of use_fast_math CUDA option, as pointed in this other issue. We will try to address this accuracy problem soon.

getkeops / keops

CUDA_ERROR_ILLEGAL_ADDRESS with Genred #328