Feature Request: SoftDTW and variants thereof

Mrjoeybux commented 10 months ago

I'm looking into kernel methods for timeseries and would like to use an exponentiated softDTW as a kernel function. Unfortunately, I don't see how to implement the required dynamic programming using the current keops syntax. This would make a great addition to the keops API as optimal alignments are used for many data types (strings, graphs, trees).

See the following references:

Marco Cuturi and Mathieu Blondel. Soft-DTW: a differentiable loss function for time-series. In Proceedings of the International Conference on Machine Learning, 894–903. JMLR. org, 2017.

Mathieu Blondel, Arthur Mensch, and Jean-Philippe Vert. Differentiable divergences between time series. 2020.

joanglaunes commented 10 months ago

Hi @Mrjoeybux , I agree this would be a nice addition to KeOps, and it is slightly challenging because the parallel scheme for DTW is different from the classical block reduction schemes that we have implemented so far. But I think this is still doable and we can propose DTW and SoftDTW reductions, alongside the sum or log-sum-exp reductions that we have currently, at some point.

Now, my previous answer was assuming you were talking about computing SoftDTW between two (large) time series, but reading again your question, I understand that you might want instead to use a SoftDTW kernel to perform higher level reductions, when your observations are (hopefully small..) time series. If this is what you were meaning, then I think it is already doable with the operations of KeOps, so we can define a SoftDTW kernel quite easily.

Mrjoeybux commented 9 months ago

Thanks for getting back to me. As you say, I am interested in the case in which my observations are small (< 50) time series and I'd like to use a SoftDTW kernel to perform higher level reductions. Do you have an example of how this could be implemented using the current KeOps syntax?

joanglaunes commented 9 months ago

Hello @Mrjoeybux , I have done several experiments, actually I was too optimistic ! I was initially thinking about writing the SoftDTW algorithm as a function acting on LazyTensor, like this code:

def SoftDTW_kernel(x, y, gamma):
    n, m = x.shape[2], y.shape[2]
    r = [[None] * m for _ in range(n)]
    for j in range(m):
        for i in range(n):
            dij = (x[:, :, i] - y[:, :, j]) ** 2
            if i == 0 and j == 0:
                r[i][j] = dij
            elif i == 0:
                r[i][j] = dij + r[i][j-1]
            elif j == 0:
                r[i][j] = dij + r[i-1][j]
            else:
                r[i][j] = dij - gamma * sum( (-a / gamma).exp() for a in (r[i-1][j-1], r[i-1][j], r[i][j-1]) ).log()
    return r[n-1][m-1]

This is in theory valid and gives what is expected... but it is completely impractical with the current implementation of KeOps, as soon as the size of the input time series are more than just 2 or 3 !!! This is because when we build LazyTensor objects, we rely on a simple string concatenation to get the kernel formula as a string and pass it to the internal KeOps compilation engine. Here the expression of the formula just grows exponentially with the input size, so it is just not possible. With several tricks I have managed to build dynamically the internal KeOps object, so I could get this type of implementation to work for size up to around 20, but then there are other parts of the code that slow down the compilation process, so this type of construction cannot be used currently.

So instead I have started to implement a custom SoftDTW operation in KeOps (that could be included in the API when it will be ready). Below you can find is a full test of this implementation. A few remarks about it:

here the implementation is for squared difference as dissimilarity. It should be possible to be generic for the choice of dissimilarity. Maybe you could let us know what is the typical dissimilarity you use ?
the gradient part is not implemented yet and it might be much less efficient because we need to store the full r[I,j]for the back propagation, as explained in the article. So for size 50 it will require to store 2500 values in the CUDA kernel.
This implementation uses the log-sum-exp trick but actually if you remove it the execution will be around 3x faster... I do not know if there are ways to avoid using the trick if you know that your dissimilarity values are within a given range.

import time

import math
import torch
from pykeops.torch import LazyTensor, Genred
from keopscore.formulas import *
from functools import reduce

# M,N are number of samples xi, yj, 
# D is size of each sample
M, N, D = 1000, 1000, 50

device_id = "cuda:0" if torch.cuda.is_available() else "cpu"

do_warmup = True

x = torch.rand(M, D, device=device_id)
y = torch.rand(N, D, device=device_id)
gamma = torch.tensor(0.1, device=device_id)

##################################
# SoftDTW operation in pytorch
##################################

def softmin(args, gamma):
    minargs = reduce(lambda x,y:torch.min(x,y), args)
    if gamma>0:
        minargs -= gamma * sum(((minargs-arg)/gamma).exp() for arg in args).log() 
    return minargs

def SoftDTW_torch(x, y, gamma):
    n, m = x.shape[1], y.shape[1]
    x, y = x[:,None,:], y[None,:,:]
    rjm1 = [torch.tensor(torch.inf, device=device_id) for _ in range(n+1)]
    rjm1[0] = torch.tensor(0., device=device_id)
    torchinf = torch.tensor(torch.inf, device=device_id)
    for j in range(1,m+1):
        rim1j = torchinf
        for i in range(1,n+1):
            rij = (x[:,:,i-1]-y[:,:,j-1])**2 + softmin((rjm1[i], rjm1[i-1], rim1j), gamma)
            rjm1[i-1] = rim1j
            rim1j = rij
        rjm1[i] = rij
    return rij

##################################
# SoftDTW operation in keops
##################################

from keopscore.formulas.Operation import Operation
from keopscore.utils.misc_utils import KeOps_Error
from keopscore.utils.code_gen_utils import c_variable, pointer, c_array, c_for_loop, c_zero_float

class SoftDTW(Operation):
    string_id = "SoftDTW"
    def __init__(self, x, y, gamma, params=()):
        # x is vector of size n, y is vector of size m, gamma is scalar,
        # output is scalar
        if gamma.dim != 1:
            KeOps_Error("input gamma should be scalar")
        super().__init__(x, y, gamma)
        self.n = x.dim
        self.m = y.dim
        self.dim = 1

    def Op(self, out, table, x, y, gamma):
        dtype = x.dtype
        n,m = self.n, self.m
        code = f"""
            #define MIN2(a,b) fminf(a,b) //(((a)<(b))?(a):(b))
            #define MIN3(a,b,c) MIN2(MIN2(a,b),c)

            {dtype} rjm1[{n}], rim1j, rij, min;

            // j=0, i=0
            rij = {x}[0] - {y}[0];
            rij *= rij;
            rim1j = rij;

            // j=0, i=1...n-1
            #pragma unroll
            for (int i=1; i<{n}; i++)
            {{
                rij = {x}[i] - {y}[0];
                rij *= rij;
                rij += rim1j;
                rjm1[i-1] = rim1j;
                rim1j = rij;
            }}
            rjm1[{n}-1] = rij;

            #pragma unroll
            for (int j=1; j<{m}; j++)
            {{
                // j=1...m-1, i=0
                rij = {x}[0] - {y}[j];
                rij *= rij;
                rij += rjm1[0];
                rim1j = rij;

                #pragma unroll
                for (int i=1; i<{n}; i++)
                {{
                    // j=1...m-1, i=1...n-1
                    rij = {x}[i] - {y}[j];
                    rij *= rij;
                    min = MIN3(rjm1[i-1],rjm1[i],rim1j);
                    rij += min - {gamma}[0] * log( exp((min-rjm1[i-1])/{gamma}[0]) + exp((min-rim1j)/{gamma}[0]) + exp((min-rjm1[i])/{gamma}[0]) );
                    rjm1[i-1] = rim1j;
                    rim1j = rij;
                }}
                rjm1[{n}-1] = rij;
            }}
            {out}[0] = rij;
                """

        return code

    def DiffT(self, v, gradin):
        KeOps_Error("autograd for SoftDTW operation not yet implemented.")
        pass

import builtins
builtins.SoftDTW = SoftDTW

#########################################
# reduction function with torch and keops
#########################################

def fun_torch(x, y, gamma):
    Sxy = SoftDTW_torch(x,y,gamma)
    Kxy = (-Sxy).exp()
    return Kxy.sum(dim=1)

def fun_keops(x, y, gamma):
    n,m = x.shape[1], y.shape[1]
    formula = "Exp(-SoftDTW(x,y,gamma))"
    aliases = [f"x=Vi({n})", f"y=Vj({m})", "gamma=Pm(1)"]
    Kxy = Genred(formula, aliases, reduction_op="Sum", axis=1)
    return Kxy(x,y,gamma.view((1,1)))

##################################
# test
##################################

funs = (fun_torch, fun_keops)
out = []
for fun in funs:
    print("**************************")
    print("Testing " + fun.__name__)
    if do_warmup:
        fun(x[:100,:], y[:100,:], gamma)
        fun(x[:100,:], y[:100,:], gamma)
    start = time.time()
    out.append(fun(x, y, gamma).squeeze())
    end = time.time()
    print("time for " + fun.__name__ + ":", end - start)

print("******")

if len(out) > 1:
    for k in range(1,len(out)):
        print(f"relative error {funs[k].__name__} vs {funs[0].__name__}:", (torch.norm(out[0] - out[k]) / torch.norm(out[0])).item())

Mrjoeybux commented 9 months ago

Hi @joanglaunes, I have just tested your approach and it passes the tests. For now, I will go ahead and integrate this into my code. Do you have any idea when this will be implemented into the API?

Regarding your remarks:

Squared difference works fine for me. Perhaps you could allow users to supply a custom dissimilarity function?
I can make do without gradients. Ideally, we have efficient gradients, but it's not the end of the world...
Ahead of time, I'm not aware of the range of my dissimilarities.

Thanks.

Mrjoeybux commented 9 months ago

Hi @joanglaunes, upon further inspection, the above code computes a sum reduction along the columns. Ideally, what I want is a symbolic kernel matrix. I have tried to integrate it into the LazyTensor API with the following method, but the results don't agree with the PyTorch implementation:

def softdtw(self, other, gamma):
    return self.ternary(
        other,
        gamma,
        "SoftDTW",
        dimres=1, 
        dimcheck=None,
    )

Any ideas?

joanglaunes commented 9 months ago

Hello @Mrjoeybux , The softdtw method that you wrote seems ok, adding it into lazy_tensor.py and then adding these lines into my test code above:

def fun_lazytensor(x, y, gamma):
    x = LazyTensor(x[:,None,:])
    y = LazyTensor(y[None,:,:])
    sdtw = x.softdtw(y,gamma)
    K = (-sdtw).exp()
    return K.sum(axis=1)

...

funs = (fun_torch, fun_keops, fun_lazytensor)

...

just give the same results. What was not correct in your case ?

About the inclusion into the API, I may try to do it by the end of this week, at least I will have some time to work on it...

Mrjoeybux commented 9 months ago

Oh my bad, I forgot to exponentiate the result...

Everything works well, thanks for your help @joanglaunes!

p.s. I'm happy to make a pull a request to include this in the API?

joanglaunes commented 8 months ago

Hello @Mrjoeybux , We just released v2.2 of pykeops, which includes the SoftDTW operation in the API, with the implementation of the gradient also. I called the operation softdtw_sqdistto highlight the fact that it is only for squared euclidean distance so far. Hopefully we can include more general forms of the operation, as well as the original DTW operation, later. So now you can perform reductions over symbolic matrix of SoftDTW dissimilarities with a few lines of code :

import torch
from pykeops.torch import LazyTensor

M, N, D = 1000, 1500, 50
device_id = "cuda:0" if torch.cuda.is_available() else "cpu"

x = torch.rand(M, 1, D, requires_grad=True, device=device_id)
y = torch.rand(1, N, D, requires_grad=True, device=device_id)
gamma = torch.tensor(0.1, device=device_id)

x_ = LazyTensor(x)
y_ = LazyTensor(y)

# apply SoftDTW operation on LazyTensor:
Sdtw = x_.softdtw_sqdist(y_, gamma)
# apply gaussian kernel over it:
K = (- Sdtw).exp() 

# compute reduction
a = K.sum(dim=1)

# compute gradient wrt x,y
g = torch.autograd.grad((a ** 2).sum(), [x,y])

getkeops / keops

Feature Request: SoftDTW and variants thereof #348