getkeops / keops

KErnel OPerationS, on CPUs and GPUs, with autodiff and without memory overflows
https://www.kernel-operations.io
MIT License
1.04k stars 64 forks source link

Differentiation wrt. a single vector dimension ? Application to computation of Laplacian #283

Open AdrienWohrer opened 1 year ago

AdrienWohrer commented 1 year ago

Hi, while I keep on discovering KeOps, I now have a question regarding symbolic implementation of the Laplacian. Assume I want to compute \sum_j \Delta K(x_i-y_j) where K is a standard Gaussian function, and Delta is the Laplacian.

I found two ways of coding this into KeOps. First version, with a hard-coded formula:

def GaussLapKernel(sigma,D):
    x, y = Vi(0, D), Vj(1, D)
    D2 = x.sqdist(y)
    K = (-D2 /(2*sigma**2)).exp()
    return (K *(D2-D*sigma**2)/sigma**4).sum_reduction(axis=1)

which works fine (after your recent fix to my previous issue).

However, I also considered the following implementation, using KeOps' symbolic differentiation capabilities, which would allow to replace the Gaussian kernel K with any other kernel, whenever needed:

def GaussK(sigma):
    def K(z):
        return (-(z**2).sum(-1)/(2*sigma**2)).exp()
    return K

def LapKernelRed(K,D):
    x, y = Vi(0, D), Vj(1, D)
    K1 = K(x-y).grad(x,1)
    Klap = K1.elem(0).grad(x,1).elem(0)
    for i in range(1,D):
        Klap = Klap + K1.elem(i).grad(x,1).elem(i)
    return Klap.sum_reduction(axis=1)

and thus, call LapKernelRed(GaussK(sigma),D). This idea worked fine to produce reductions based on kernel K itself, and on its gradient as well (basically, performing the reduction directly on K1 in the code above). Execution time is as good, or even slightly better, as with the hard-coded formulas. However, for the Laplacian, this ''generic'' implementation is slower than the hard-coded formula, by a large amount when the data get big. I assume this is because, in my "generic" computation, I am forced to compute full derivatives wrt. x (so basically, all terms of the Hessian), even though I am actually only interested in the derivative of gradK[i] wrt x[i]. Is there any way I can do this in KeOps? At least, the obvious try K1.elem(i).grad(x.elem(i),1) did not work.

joanglaunes commented 1 year ago

Hello @AdrienWohrer , Your guess is correct, the generic implementation is slow because it involves computing all terms of the hessian. Also, by summing for i in the loop, the resulting formula should involve computing this same hessian D times. This is probably simplified by C++ compiler optimizations, but we cannot know for sure. In previous versions of KeOps (prior to 2.0) we had an "AutoFactorize" operation that allowed to automatically simplify such formulas by detecting identical sub formulas, but it was so demanding for the compiler that we decided to disable it. With the new Python engine, we could definitely add this automatic simplification, it should be completely ok now. But actually the real answer for getting the laplacian would be to implement In KeOps a trace operation. In your code, you could then get the laplacian by doing something like H=K1.grad(x,v); Klap=trace(H,v), where v would be an additional symbolic variable, and trace(H,v) would correspond to the trace of H when considered as a linear function of v. This trace operation would act at the symbolic level, so the resulting formula would not contain the hessian. Coding such a trace operator is not straightforward, because it cannot be implemented as a regular mathematical function, we must derive the corresponding chain rule logic, as it is the case for grad. But it should not be that hard.

AdrienWohrer commented 1 year ago

Hi, thanks for the detailed explanation. For the moment, I will thus stick to the hard-coded formula, which is fast, and fine for my intended usage. Anyway, in the particular case of the Gaussian kernel, the trace operation of the Laplacian leads to the apparition of the square norm D2=|x-y|^2, which has the computational advantage of having already been computed above. My guess is that an automatic differentiation -- even smart and optimized -- could not detect that simplification, so it would always be somewhat slower than the hard-coded formula. (Well I'm just guessing !) Thanks again for your great job.

joanglaunes commented 1 year ago

In fact there is a little subtlety : in the hard-coded formula, the fact that you define D2=|x-y|^2 and then use it twice to define Klap does not imply that it will be computed only once when calling the reduction... Currently, in the formula that is given to the internal engine for generating the corresponding cuda code, the sequence of operations for computing D2 appears in fact twice. Explicitly telling KeOps to compute D2 only once, saving the result to a temp variable, is related to the "factorize" operation I was referring to in my previous answer. It is a feature we should reintroduce in KeOps. But for such rather simple formulas, we have noticed that the compiler itself, when compiling our generated c++ code, detects pieces of code that do several times the same calculations, and does the job of factorizing. I really think this is the case for your hard coded formula, so that finally the execution time should be already optimal. And finally, with a generic implementation of the laplacian, if we manage to do it properly, it should be also ok to detect automatically the simplification, so that the generic implementation will be optimal also.

AdrienWohrer commented 1 year ago

Ok, now it's really clear. Thanks !

joanglaunes commented 8 months ago

Hello @AdrienWohrer , Sorry for not having keeping you updated. I had worked on this a few months ago, and now we released v2.2 of KeOps last week, which includes implementation of the laplacian and other operations such as gradient and divergence, in the way I was explaining in my previous comments, i.e. I implemented a symbolic trace operation, which allows to get the formula of the laplacian and avoids the computation of the full hessian. So coming back to your initial message, you can now define a laplacian kernel this way:

def GaussLapKernel(sigma,D):
    x, y = Vi(0, D), Vj(1, D)
    D2 = x.sqdist(y)
    K = (-D2 /(2*sigma**2)).exp()
    Klap = K.laplacian(x)
    return Klap.sum_reduction(axis=1)

This is much faster than using the first generic implementation with the hessian, but still around twice slower than the hard coded formula, because KeOps does not manage to get a simplified version of the analytical expression, because of the 1/(2*sigma**2) constant which slightly complicates it. So finally I found a way to get this generic computation as fast as the hard-coded formula by keeping the sigma constant out of the computation of the laplacian, as follows:

def GaussLapKernel(sigma,D):
    x, y = Vi(0, D), Vj(1, D)
    D2 = x.sqdist(y)
    K = (-D2).exp()
    Klap = K.laplacian(x)
    Klapsum = Klap.sum_reduction(axis=1)
    c = 1/math.sqrt(2)/sigma
    return lambda x,y : c**2 * Klapsum(x*c,y*c)
AdrienWohrer commented 8 months ago

Thanks @joanglaunes and KeOps team, this looks great. I will try to switch to KeOps v2.2 some time soon.