getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

Strange error that only happens with certain input dimensions #262

Closed Mithrillion closed 1 year ago

Mithrillion commented 2 years ago

I have encountered an error that happens only when the input dimension to a kNN formula has certain dimensions. Here is my test code:

import torch
import typing
import pykeops.torch as ktorch

def cross_knn_search(
    A: torch.Tensor, B: torch.Tensor, k: int
) -> typing.Tuple[torch.Tensor, torch.Tensor]:
    X_i = ktorch.LazyTensor(A[:, None, :])
    X_j = ktorch.LazyTensor(B[None, :, :])
    D_ij = ((X_i - X_j) ** 2).sum(-1)
    D, I = D_ij.Kmin_argKmin(K=k, dim=1)
    D = torch.sqrt(D)
    return D, I

input_a = torch.randn(204, 405 * 61)
input_b = torch.randn(205, 405 * 61)

D, I = cross_knn_search(input_b, input_a, 1)

This returns the error (full trace below) ValueError: [KeOps] Error : args must be c_variable or c_array instances (error at line 382 in file .../lib/python3.9/site-packages/keopscore/utils/code_gen_utils.py)

The same code runs for almost any other input dimensions, such as (204, 405 * 61+1), (204, 405 * 62) or (204, 405 * 60). The error persists if I change the formula to argKmin or Kmin only, and it happens with both numpy and torch bindings. Clearing KeOps cache does not seem to work either.

My PyKeOps version is 2.1 release, Python version is 3.9 and here is the nvcc message:

Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0

The system is running Ubuntu 22.04. The GPU used is an RTX3090.

Edit: error reproduceable on colab: https://colab.research.google.com/drive/1Zu93CDL6KTLOV_skg1uUJ1rcAdRoUJFI?usp=sharing

The full trace is:

.../error_test.py in <cell line: 22>()
     19 input_a = torch.randn(204, 405 * 61)
     20 input_b = torch.randn(205, 405 * 61)
---> 22 D, I = cross_knn_search(input_b, input_a, 1)

.../error_test.py in cross_knn_search(A, B, k)
     12 X_j = ktorch.LazyTensor(B[None, :, :])
     13 D_ij = ((X_i - X_j) ** 2).sum(-1)
---> 14 D, I = D_ij.Kmin_argKmin(K=k, dim=1)
     15 D = torch.sqrt(D)
     16 return D, I

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/lazy_tensor.py:2184, in GenericLazyTensor.Kmin_argKmin(self, K, axis, dim, **kwargs)
   2166 def Kmin_argKmin(self, K, axis=None, dim=None, **kwargs):
   2167     r"""
   2168     K-Min-argK-min reduction.
   2169 
   (...)
   2182 
   2183     """
-> 2184     return self.reduction("KMin_ArgKMin", opt_arg=K, axis=axis, dim=dim, **kwargs)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/lazy_tensor.py:755, in GenericLazyTensor.reduction(self, reduction_op, other, opt_arg, axis, dim, call, is_complex, **kwargs)
    744     res.callfun = res.Genred(
    745         res.formula,
    746         [],
   (...)
    752         rec_multVar_highdim=res.rec_multVar_highdim,
    753     )
    754 if call and len(res.symbolic_variables) == 0 and res._dtype is not None:
--> 755     return res()
    756 else:
    757     return res

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/lazy_tensor.py:937, in GenericLazyTensor.__call__(self, *args, **kwargs)
    934     # we replace by other
    935     args = (self.other.variables[0],)
--> 937 return self.callfun(*args, *self.variables, **self.kwargs)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/torch/generic/generic_red.py:624, in Genred.__call__(self, backend, device_id, ranges, out, *args)
    619     elif nred > 2048 and dtype in ("float16", "half"):
    620         raise ValueError(
    621             "size of input array is too large for Arg type reduction with float16 dtype.."
    622         )
--> 624 out = GenredAutograd.apply(
    625     self.formula,
    626     self.aliases,
    627     backend,
    628     dtype,
    629     device_id,
    630     ranges,
    631     self.optional_flags,
    632     self.rec_multVar_highdim,
    633     nx,
    634     ny,
    635     out,
    636     *args
    637 )
    639 return postprocess(out, "torch", self.reduction_op, nout, self.opt_arg, dtype)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/torch/generic/generic_red.py:78, in GenredAutograd.forward(ctx, formula, aliases, backend, dtype, device_id_request, ranges, optional_flags, rec_multVar_highdim, nx, ny, out, *args)
     72             raise ValueError(
     73                 "[KeOps] Gpu device id of arrays is different from device id requested for computation."
     74             )
     76 from pykeops.common.keops_io import keops_binder
---> 78 myconv = keops_binder["nvrtc" if tagCPUGPU else "cpp"](
     79     tagCPUGPU,
     80     tag1D2D,
     81     tagHostDevice,
     82     use_ranges,
     83     device_id_request,
     84     formula,
     85     aliases,
     86     len(args),
     87     dtype,
     88     "torch",
     89     optional_flags,
     90 ).import_module()
     92 # Context variables: save everything to compute the gradient:
     93 ctx.formula = formula

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/utils/Cache.py:68, in Cache_partial.__call__(self, *args)
     66     self.library[str_id] = self.cls(params, fast_init=True)
     67 else:
---> 68     obj = self.cls(*args)
     69     self.library_params[str_id] = obj.params
     70     self.library[str_id] = obj

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps_nvrtc.py:15, in LoadKeOps_nvrtc_class.__init__(self, fast_init, *args)
     14 def __init__(self, *args, fast_init=False):
---> 15     super().__init__(*args, fast_init=fast_init)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps.py:18, in LoadKeOps.__init__(self, fast_init, *args)
     16     self.params = args[0]
     17 else:
---> 18     self.init(*args)
     19 self.dimout = self.params.dim
     20 self.tagIJ = self.params.tagI

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/pykeops/common/keops_io/LoadKeOps.py:126, in LoadKeOps.init(self, tagCPUGPU, tag1D2D, tagHostDevice, use_ranges, device_id_request, formula, aliases, nargs, dtype, lang, optional_flags)
    104 if use_ranges:
    105     map_reduce_id += "_ranges"
    107 (
    108     self.params.tag,
    109     self.params.source_name,
    110     self.params.low_level_code_file,
    111     self.params.tagI,
    112     self.params.tagZero,
    113     self.params.use_half,
    114     self.params.cuda_block_size,
    115     self.params.use_chunk_mode,
    116     self.params.tag1D2D,
    117     self.params.dimred,
    118     self.params.dim,
    119     self.params.dimy,
    120     indsi,
    121     indsj,
    122     indsp,
    123     dimsx,
    124     dimsy,
    125     dimsp,
--> 126 ) = get_keops_dll(
    127     map_reduce_id,
    128     self.params.red_formula_string,
    129     self.params.enable_chunks,
    130     self.params.enable_final_chunks,
    131     self.params.mult_var_highdim,
    132     self.params.aliases,
    133     nargs,
    134     self.params.c_dtype,
    135     self.params.c_dtype_acc,
    136     self.params.sum_scheme,
    137     self.params.tagHostDevice,
    138     tagCPUGPU,
    139     tag1D2D,
    140     self.params.use_half,
    141     device_id_request,
    142 )
    144 # now we switch indsi, indsj and dimsx, dimsy in case tagI=1.
    145 # This is to be consistent with the convention used in the old
    146 # bindings (see functions GetIndsI, GetIndsJ, GetDimsX, GetDimsY
    147 # from file binder_interface.h. Clearly we could do better if we
    148 # carefully rewrite some parts of the code
    149 if self.params.tagI == 1:

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/utils/Cache.py:27, in Cache.__call__(self, *args)
     25 str_id = "".join(list(str(arg) for arg in args)) + str(env_param)
     26 if not str_id in self.library:
---> 27     self.library[str_id] = self.fun(*args)
     28 return self.library[str_id]

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/get_keops_dll.py:124, in get_keops_dll_impl(map_reduce_id, red_formula_string, enable_chunks, enable_finalchunks, mul_var_highdim, aliases, *args)
    121 else:
    122     tagZero = 0
--> 124 res = map_reduce_obj.get_dll_and_params()
    126 tag1D2D = 0 if tagZero == 1 else res["tag1D2D"]
    128 return (
    129     res["tag"],
    130     res["source_file"],
   (...)
    146     res["dimsp"],
    147 )

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/binders/LinkCompile.py:101, in LinkCompile.get_dll_and_params(self)
     95 if not os.path.exists(self.file_to_check):
     96     KeOps_Message(
     97         "Generating code for formula " + self.red_formula.__str__() + " ... ",
     98         flush=True,
     99         end="",
    100     )
--> 101     self.generate_code()
    102     self.save_info()
    103     KeOps_Message("OK", use_tag=False, flush=True)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/binders/nvrtc/Gpu_link_compile.py:63, in Gpu_link_compile.generate_code(self)
     60 def generate_code(self):
     61     # method to generate the code and compile it
     62     # generate the code and save it in self.code, by calling get_code method from GpuReduc class :
---> 63     self.get_code()
     64     # write the code in the source file
     65     self.write_code()

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/mapreduce/gpu/GpuReduc1D_chunks.py:190, in GpuReduc1D_chunks.get_code(self)
    162 chunk_sub_routine = do_chunk_sub(
    163     dtype,
    164     red_formula,
   (...)
    186     param_loc,
    187 )
    189 last_chunk = c_variable("int", f"{chk.nchunks - 1}")
--> 190 chunk_sub_routine_last = do_chunk_sub(
    191     dtype,
    192     red_formula,
    193     chk.fun_lastchunked,
    194     chk.dimlastchunk,
    195     chk.dimsx_last,
    196     chk.dimsy_last,
    197     chk.indsi,
    198     chk.indsj,
    199     chk.indsi_lastchunked,
    200     chk.indsj_lastchunked,
    201     acc,
    202     tile,
    203     i,
    204     j,
    205     jstart,
    206     last_chunk,
    207     nx,
    208     ny,
    209     arg,
    210     fout_chunk,
    211     xi,
    212     yj,
    213     yjrel,
    214     param_loc,
    215 )
    217 foutj = c_array(dtype, chk.dimout_chunk, "foutj")
    218 chktable_out = table4(
    219     chk.nminargs + 1,
    220     chk.dimsx,
   (...)
    231     foutj,
    232 )

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/mapreduce/gpu/GpuReduc1D_chunks.py:99, in do_chunk_sub(dtype, red_formula, fun_chunked_curr, dimchunk_curr, dimsx, dimsy, indsi, indsj, indsi_chunked, indsj_chunked, acc, tile, i, j, jstart, chunk, nx, ny, arg, fout, xi, yj, yjrel, param_loc)
     71 chktable = table(
     72     chk.nminargs,
     73     dimsx,
   (...)
     81     param_loc,
     82 )
     83 foutj = c_variable(pointer(dtype), "foutj")
     85 return f"""
     86             // Starting chunk_sub routine
     87             {fout_tmp_chunk.declare()}
     88             if ({i.id} < {nx.id}) {{
     89                 {load_chunks_routine_i}
     90             }} 
     91             if (j < ny) {{ // we load yj from device global memory only if j<ny
     92                 {load_chunks_routine_j}
     93             }}
     94             __syncthreads();
     95             if ({i.id} < {nx.id}) {{ // we compute only if needed
     96                 {dtype} *yjrel = {yj.id}; // Loop on the columns of the current block.
     97                 for (int jrel = 0; (jrel < blockDim.x) && (jrel < {ny.id} - jstart); jrel++, yjrel += {chk.dimy}) {{
     98                     {dtype} *foutj = {fout.id} + jrel*{chk.fun_chunked.dim};
---> 99                     {fun_chunked_curr(fout_tmp_chunk, chktable)}
    100                     {chk.fun_chunked.acc_chunk(foutj, fout_tmp_chunk)}
    101                 }}
    102             }} 
    103             __syncthreads();
    104             // Finished chunk_sub routine
    105         """

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/formulas/Operation.py:82, in Operation.__call__(self, out, table)
     80         string += f"{arg.declare()}\n"
     81         # Now we evaluate the child operation and append the result into string
---> 82         string += child(arg, table)
     83     args.append(arg)
     84 # Finally, evaluation of the operation itself

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/formulas/Operation.py:82, in Operation.__call__(self, out, table)
     80         string += f"{arg.declare()}\n"
     81         # Now we evaluate the child operation and append the result into string
---> 82         string += child(arg, table)
     83     args.append(arg)
     84 # Finally, evaluation of the operation itself

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/formulas/Operation.py:85, in Operation.__call__(self, out, table)
     83     args.append(arg)
     84 # Finally, evaluation of the operation itself
---> 85 string += self.Op(out, table, *args)
     87 # some debugging helper :
     88 if debug_ops_at_exec:

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/formulas/VectorizedScalarOp.py:26, in VectorizedScalarOp.Op(self, out, table, *args)
     23 def Op(self, out, table, *args):
     24     # Atomic evaluation of the operation : it consists in a simple
     25     # for loop around the call to the correponding scalar operation
---> 26     return VectApply(self.ScalarOp, out, *args)

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/utils/code_gen_utils.py:382, in VectApply(fun, out, *args)
    380         dims.append(arg.dim)
    381     else:
--> 382         KeOps_Error("args must be c_variable or c_array instances")
    383 dimloop = max(dims)
    384 if not set(dims) in ({dimloop}, {1, dimloop}):

File ~/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/utils/misc_utils.py:28, in KeOps_Error(message, show_line_number)
     26     frameinfo = getframeinfo(currentframe().f_back)
     27     message += f" (error at line {frameinfo.lineno} in file {frameinfo.filename})"
---> 28 raise ValueError(message)

ValueError: [KeOps] Error : args must be c_variable or c_array instances (error at line 382 in file /home/user/anaconda3/envs/tsr/lib/python3.9/site-packages/keopscore/utils/code_gen_utils.py)
jeanfeydy commented 2 years ago

Hi @Mithrillion,

Thanks for your detailed bug report: this is very unexpected indeed! The error is related to the "chunks" optimization that has been implemented by @joanglaunes: I'm sure that he will be able to fix it after the summer vacations.

Until then, and in any case, I wouldn't advise you to use KeOps with such high-dimensional features. As detailed in our NeurIPS paper, a good rule of thumb is that KeOps stops being that useful in spaces of dimension D > 50 or 100. Since you are working with features in a space of dimension 24,705 (!), I would suggest that you pick one of the following 3 strategies:

  1. Use a dimensionality reduction technique (e.g. PCA or UMap) to create feature vectors of dimension 16-32 and then use KeOps. This would also allow you to have a better control about your algorithm - the curse of dimensionality is very real. Please also note that in your (toy?) example, you only have 205 samples so you should be able to embed your points perfectly in a space of dimension 204 with PCA - all the other features are redundant.

  2. Use a vanilla PyTorch implementation to compute your matrix of squared distances and perform an argmin reduction. You should use a polar identity to compute your distance matrix as (A**2).sum(1).view(N,1) + (B**2).sum(1).view(1,M) - 2 * A @ B.T to leverage the fast matrix-matrix multiply kernels. It is also likely that you could get away with float16 numerical precision to speed things up further and leverage your Tensor cores. See this page in the PyTorch doc for more info.

  3. If you are dealing with lots of samples (> 10k) and really need all of the 24k input features, use a dedicated library such as FAISS. The ANN-benchmarks website is a great resource.

What do you think? Best regards, Jean

Mithrillion commented 2 years ago

@jeanfeydy Thanks for the response! Glad the cause of this strange problem is found. And thanks for the advice on implementation. This indeed is more of a toy example. In practice, I noticed similar issues for much lower dimensions (a few specific values between 64-512 for a particular input), and I usually work around that by changing the input dimension. The "huge" input dimension I used here was discovered when I tried to compare naive flattened Euclidean distance kNN with distance over a much more sensible feature set, using the same implementation, therefore the weird batch size to dimension size ratio. The original data is actually a time series with some redundant dimensions.

But great advice still! I do use PCA and UMAP frequently myself, and the batch size for my data is roughly in the range where KeOps is faster than index-based kNN methods. KeOps works well for me because I only need a sparse kNN matrix (the full pairwise matrix is way too large). It manages memory and GPU utilisation better than any other tools I have.

Thanks again!

jeanfeydy commented 2 years ago

Hi @Mithrillion, I see: thanks again for the detailed explanation of your use case, such a feedback is very useful to us. We'll close the issue once the bug in the "chunks" optimization is solved then, probably in September. Best regards, Jean

joanglaunes commented 1 year ago

Hello @Mithrillion and @jeanfeydy , About this bug, I finally fixed it. It was indeed the "chunks" computation mode : in the example the dimension is 405 61 = 24705 which equals 38664+1. It means that in the cuda kernel the variables are cut into 386 segments of dimension 64 each - to avoid loading the whole variables in the local memory - plus one remaining segment of dimension 1, and this last chunk of dimension 1 was causing the problem.