pykeops 2.0 Pytorch backward pass error: "AttributeError: 'Var' object has no attribute 'assign'"

albertfgu commented 2 years ago

Here is a minimal example file of a simple Genred reduction. The forward pass works, but the backward pass raises an error in pykeops version 2.0. This previously worked fine in pykeops version 1.5

import torch
import pykeops
from pykeops.torch import Genred

def cauchy(v, z, w):
    expr = 'ComplexDivide(v, z-w)'
    cauchy_mult = Genred(
        expr,
        [
            'v = Vj(2)',
            'z = Vi(2)',
            'w = Vj(2)',
        ],
        reduction_op='Sum',
        axis=1,
    )

    v = torch.view_as_real(v)
    z = torch.view_as_real(z)
    w = torch.view_as_real(w)

    r = cauchy_mult(v, z, w, backend='GPU')
    return torch.view_as_complex(r)

def convert_data(*tensors, device='cuda'):
    """ Prepare tensors for backwards pass """
    tensors = tuple(t.to(device) for t in tensors)
    for t in tensors:
        if t.is_leaf: t.requires_grad = True
        t.retain_grad()
    return tensors

def data(B, N, L):
    w = torch.randn(B, N, dtype=torch.cfloat)
    v = torch.randn(B, N, dtype=torch.cfloat)
    z = torch.randn(B, L, dtype=torch.cfloat)

    w, v, z = utils.convert_data(w, v, z)
    return w, v, z

def test():
    B = 4
    N = 64
    L = 256
    w, v, z = data(B, N, L)

    y = cauchy(v, z, w)
    print("output", y.shape, y.dtype)

    grad = torch.randn_like(y)
    y.backward(grad, retain_graph=True) # Errors here!

if __name__ == '__main__':
    test()

joanglaunes commented 2 years ago

Hello @albertfgu , This error should be fixed now in the main branch. It was an error in the code for the gradient of the complex multiplication class. If you can give it a try, please let us know if it is fixed on your side. Also do not hesitate to give us feedback regarding the use of operations on complex-valued data with KeOps.

albertfgu commented 2 years ago

Thanks for updating this! Overall keops has been handling complex numbers quite seamlessly. The documentation seems slightly out of date, as I was looking for a ComplexExp and didn't see it listed here: https://www.kernel-operations.io/keops/api/math-operations.html But it turns out it exists anyways!

albertfgu commented 2 years ago

I have another issue that I don't recall having last time I checked pykeops version 2.0. On one of my environments, pykeops==1.5 works fine, but pykeops==2.0 gives this:

❯ python
Python 3.8.12 (default, Oct 12 2021, 13:49:34)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pykeops
>>> pykeops.clean_pykeops()
[KeOps] /dfs/scratch1/albertgu/.cache/keops2.0/build has been cleaned.
[KeOps] Compiling cuda jit compiler engine ... OK
[pyKeOps] Compiling nvrtc binder for python ... OK
>>> pykeops.test_numpy_bindings()
[KeOps] Generating code for formula Sum_Reduction((Var(0,3,0)-Var(1,3,1))|(Var(0,3,0)-Var(1,3,1)),1) ... OK

[KeOps] error: cuModuleLoadDataEx(&module, target, 0, NULL, NULL) failed with error CUDA_ERROR_INVALID_PTX

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/numpy/test_install.py", line 20, in test_numpy_bindings
    if np.allclose(my_conv(x, y).flatten(), expected_res):
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/numpy/generic/generic_red.py", line 303, in __call__
    self.myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/keopscore/utils/Cache.py", line 68, in __call__
    obj = self.cls(*args)
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 15, in __init__
    super().__init__(*args, fast_init=fast_init)
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps.py", line 31, in __init__
    self.init_phase2()
  File "/dfs/scratch1/albertgu/anaconda3/envs/hippo/lib/python3.8/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py", line 23, in init_phase2
    self.launch_keops = pykeops_nvrtc.KeOps_module_float(
RuntimeError: [KeOps] Cuda error.

Does pykeops 2.0 require a specific CUDA version? This environment is on 10.2; my other environment with CUDA 11.1 works fine

jeanfeydy commented 1 year ago

Hi @albertfgu, Apologies for the delayed answer - I had missed your reply and am reviewing all pending issues just now. In theory, you should have no problem with CUDA 10.2 - I used this version for quite a long time with KeOps v2.0.

I see several hypotheses for your issue, one of them being that your CUDA 10.2 folder does not contain the development headers. Notably, KeOps expects to find nvrtc.h and cuda.h in $CUDA_PATH/include. If this fails, we also try:

/opt/cuda/include/, /opt/cuda/targets/x86_64-linux/include/,
/usr/local/cuda/include/, /usr/local/cuda/targets/x86_64-linux/include/,
/usr/local/cuda-10.2/include/, /usr/local/cuda-10.2/targets/x86_64-linux/include/.

(The code for this is available here.) Does your CUDA 10.2 installation contain these files?

Alternatively, there may be a mis-match between the concurrent versions of CUDA that are present on your system: the PTX in CUDA_ERROR_INVALID_PTX refers to the intermediate representation that is used by the CUDA compiler. What may be happening here is that KeOps somehow used your CUDA v11.1 compiler to produce the PTX, and then used CUDA v10.2 to compile or access it, resulting in this error.

Best regards, Jean

smiles724 commented 8 months ago

Just downgrade the version of pykeops from 2.0+ to 1.5. Things go well then. I suppose the latest version is not stable yet.

getkeops / keops

pykeops 2.0 Pytorch backward pass error: "AttributeError: 'Var' object has no attribute 'assign'" #238