getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 63 forks source link

Conversion from symbolic LazyTensor to standard dense array #126

Open hypnopump opened 3 years ago

hypnopump commented 3 years ago

Is there any plans to support evaluation of a LazyTensor? I want to retrieve the kernel matrix and it has to be M x M, not M x K.

Also, I think that would increase interoperability of the library with other software pieces (I'm talking about something like pytorch's .numpy() conversion).

jeanfeydy commented 3 years ago

Hi @EricAlcaide ,

Indeed, we plan to support a .dense() conversion operator at some point in the future. This woud be most useful when combined with a sub-sampling indexing operator to implement e.g. the Nyström rule efficiently. Note, however, that we see this as a fairly low priority issue for the moment. In practice, a simple work around is to write the same computations "twice": once in KeOps to define the LazyTensor, and once in NumPy/PyTorch to get access to the dense array. Writing advanced tutorials and benchmarks, accelerating the compilation times and adding support for Tensor cores are our top priorities for this winter / early 2021. I will probably write this .dense() operator around March-April as part of our plan to support approximate sum reductions.

Out of curiosity, what kind of use case do you have in mind for this feature? User feedback is key to prioritize issues, and the asymetric nature of online conferences this year really does not help us to understand what types of computations would be most useful to the community :-)

Best regards, Jean

hypnopump commented 3 years ago

So I'm trying to implement an Isolation Kernel in numpy/pytorch and I was wondering if KeOps could help me get better speeds. The source code can be found here: https://gist.github.com/EricAlcaide/4a342b91d80b5260f8f7d3c0bca6a150 (i was trying to speed the calculation of the kernel matrix, but any other performance improvement is welcome)

alexisthual commented 3 years ago

Hi, thanks for the great work @jeanfeydy ! I am also interested in this feature. Moreover, I think it could be useful to some users to slice a LazyTensor before calling .dense().

In case you are interested, I describe my use-case below. After defining a LazyTensor for the transport plan pi from the biased dual potentials, I want to assess some properties of the underlying matching. For instance, let G be a N x N dense matrix of distances between all pairs of points (for a distance which is not the one that I used to compute pi ; could be a geodesic distance on a mesh for instance), I am interested in computing (pi * G).sum(dim=1), which would yield the average distance from any source point to all target points it has been matched with. Until now, I haven't found a way to turn G from a 2D numpy array to a LazyTensor with a shape such that pi * G is what I want, hence my wish to turn the LazyTensor pi into a dense numpy array and perform this simple computation on CPU.

I hope this makes sense!

jeanfeydy commented 3 years ago

Hi @alexisthual ,

I agree! Since last December, our position on the subject has evolved quite a bit and we now agree that this is a key feature to implement in the next few releases. However, just like other similar features (e.g. support for Nystroem rules or tensor cores), we have postponed its development until the release of our new compilation engine: @joanglaunes and @bcharlier have been working extensively on re-casting the numerical foundations of KeOps this summer to smash compilation times and reduce the complexity of our C++ codebase. Hopefully, we will be able to merge their work in the next few months and finally move back to adding new features soon.

With respect to your problem: the best option currently may be to use the dual potentials to build the transport tiles using vanilla PyTorch. As you know, the equation reads something like: pi_ij = a_i * b_j * exp((f_i + g_j - C(x_i,y_j)) / eps) = (a_i * exp(f_i / eps)) * K_ij * (b_j * exp(g_j / eps))

Building the transport plan (possibly tile-by-tile) on the CPU as a scaling of the kernel matrix should be very do-able. I do not recommend to do this within the Sinkhorn loop for the sake of speed and numerical accuracy... But in your case, as a "final performance metric" that does not rely on log(pi_ij), this seems very reasonable.

What do you think?

Best regards, Jean

alexisthual commented 3 years ago

Oh yes! That should very much do it, thanks for the tip :slightly_smiling_face: