Closed Mithrillion closed 1 year ago
Hi @Mithrillion,
Thanks for your detailed bug report: this is very unexpected indeed! The error is related to the "chunks" optimization that has been implemented by @joanglaunes: I'm sure that he will be able to fix it after the summer vacations.
Until then, and in any case, I wouldn't advise you to use KeOps with such high-dimensional features. As detailed in our NeurIPS paper, a good rule of thumb is that KeOps stops being that useful in spaces of dimension D > 50 or 100. Since you are working with features in a space of dimension 24,705 (!), I would suggest that you pick one of the following 3 strategies:
Use a dimensionality reduction technique (e.g. PCA or UMap) to create feature vectors of dimension 16-32 and then use KeOps. This would also allow you to have a better control about your algorithm - the curse of dimensionality is very real. Please also note that in your (toy?) example, you only have 205 samples so you should be able to embed your points perfectly in a space of dimension 204 with PCA - all the other features are redundant.
Use a vanilla PyTorch implementation to compute your matrix of squared distances and perform an argmin reduction. You should use a polar identity to compute your distance matrix as (A**2).sum(1).view(N,1) + (B**2).sum(1).view(1,M) - 2 * A @ B.T
to leverage the fast matrix-matrix multiply kernels. It is also likely that you could get away with float16
numerical precision to speed things up further and leverage your Tensor cores. See this page in the PyTorch doc for more info.
If you are dealing with lots of samples (> 10k) and really need all of the 24k input features, use a dedicated library such as FAISS. The ANN-benchmarks website is a great resource.
What do you think? Best regards, Jean
@jeanfeydy Thanks for the response! Glad the cause of this strange problem is found. And thanks for the advice on implementation. This indeed is more of a toy example. In practice, I noticed similar issues for much lower dimensions (a few specific values between 64-512 for a particular input), and I usually work around that by changing the input dimension. The "huge" input dimension I used here was discovered when I tried to compare naive flattened Euclidean distance kNN with distance over a much more sensible feature set, using the same implementation, therefore the weird batch size to dimension size ratio. The original data is actually a time series with some redundant dimensions.
But great advice still! I do use PCA and UMAP frequently myself, and the batch size for my data is roughly in the range where KeOps is faster than index-based kNN methods. KeOps works well for me because I only need a sparse kNN matrix (the full pairwise matrix is way too large). It manages memory and GPU utilisation better than any other tools I have.
Thanks again!
Hi @Mithrillion, I see: thanks again for the detailed explanation of your use case, such a feedback is very useful to us. We'll close the issue once the bug in the "chunks" optimization is solved then, probably in September. Best regards, Jean
Hello @Mithrillion and @jeanfeydy , About this bug, I finally fixed it. It was indeed the "chunks" computation mode : in the example the dimension is 405 61 = 24705 which equals 38664+1. It means that in the cuda kernel the variables are cut into 386 segments of dimension 64 each - to avoid loading the whole variables in the local memory - plus one remaining segment of dimension 1, and this last chunk of dimension 1 was causing the problem.
I have encountered an error that happens only when the input dimension to a kNN formula has certain dimensions. Here is my test code:
This returns the error (full trace below)
ValueError: [KeOps] Error : args must be c_variable or c_array instances (error at line 382 in file .../lib/python3.9/site-packages/keopscore/utils/code_gen_utils.py)
The same code runs for almost any other input dimensions, such as
(204, 405 * 61+1)
,(204, 405 * 62)
or(204, 405 * 60)
. The error persists if I change the formula to argKmin or Kmin only, and it happens with both numpy and torch bindings. Clearing KeOps cache does not seem to work either.My PyKeOps version is 2.1 release, Python version is 3.9 and here is the nvcc message:
The system is running Ubuntu 22.04. The GPU used is an RTX3090.
Edit: error reproduceable on colab: https://colab.research.google.com/drive/1Zu93CDL6KTLOV_skg1uUJ1rcAdRoUJFI?usp=sharing
The full trace is: