A General Optimization Framework for Dynamic Time Warping

ericphanson commented 4 years ago

Hi @baggepinnen,

I came across this paper today https://arxiv.org/abs/1905.12893, and thought it was quite nice; it seemed like a sensible framework for introducing regularization similar to your transportcost parameter. I'm also not quite sure how their "time centered mean" corresponds to your barycenters. They also claim 50x speed ups compared to the python FastDTW package, but I'm not sure it would be faster than your code here. I thought I'd mention it here in case it was interesting to you as well.

Cheers, Eric

ericphanson commented 4 years ago

Oh, I just saw the code doesn't exist... https://github.com/dderiso/gdtw :cry:

baggepinnen commented 4 years ago

Thanks for the reference Eric, I had not seen it before (I had never looked at DTW until a few weeks ago). Their approach is probably more principled than the transportcost which is a completely adhoc addition, but which seems to do what I wanted :angel:

The barycenter algorithm in this package comes from the authors of the repo I forked and is the most simple DTW barycenter algorithm. It's known to produce weird barycenters (but reasonable fast). I'll have a look at the article you linked and see if some of their stuff would fit nicely in here :)

ericphanson commented 4 years ago

I've only recently gotten into DTW too.

I wanted to do a comparison to see if a 50x speedup was a good sign, so I did some simple benchmarks:

julia> using PyCall, DynamicAxisWarping, BenchmarkTools, Distances

julia> const FASTDTW = pyimport_conda("fastdtw", "fastdtw")
PyObject <module 'fastdtw' from '/home/eric/.julia/conda/3/lib/python3.7/site-packages/fastdtw/__init__.py'>

julia> x = rand(20, 100);

julia> y = rand(20, 100);

julia> r = 100
100

julia> xp = permutedims(x);

julia> yp = permutedims(y);

julia> @btime DynamicAxisWarping.dtw($x,$y, $(SqEuclidean()));
  88.584 μs (11 allocations: 85.33 KiB)

julia> @btime DynamicAxisWarping.fastdtw($x,$y, $(SqEuclidean()), $r);
  88.693 μs (11 allocations: 85.33 KiB)

julia> @btime DynamicAxisWarping.dtw_cost($x,$y, $(SqEuclidean()), $r, s1 = $(zeros(2r+1)), s2 = $(zeros(2r+1)));
  128.094 μs (0 allocations: 0 bytes)

julia> @btime FASTDTW.fastdtw($xp, $yp; radius = $r, dist=$(SqEuclidean()));
  139.339 ms (861612 allocations: 36.06 MiB)

julia> r = 20
20

julia> @btime DynamicAxisWarping.fastdtw($x,$y, $(SqEuclidean()), $r);
  280.930 μs (213 allocations: 179.83 KiB)

julia> @btime DynamicAxisWarping.dtw_cost($x,$y, $(SqEuclidean()), $r, s1 = $(zeros(2r+1)), s2 = $(zeros(2r+1)));
  47.091 μs (0 allocations: 0 bytes)

julia> @btime FASTDTW.fastdtw($xp, $yp; radius = $r, dist=$(SqEuclidean()));
  256.915 ms (1104904 allocations: 46.42 MiB)

A couple things stood out to me: first, we're seeing 3 orders of magnitude perf improvements in DynamicAxisWarping.jl over the python fastdtw! But there's also a ton of allocations there. I wonder if I'm not calling it correctly? (The permutedims are to align with row-majorness of numpy, and are needed to get the same answers). My understanding is PyCall shares memory between Julia arrays and numpy arrays, so it should be pretty efficient, I'd think.

Second, I see worse perf in DynamicAxisWarping.fastdtw and FASTDTW.fastdtw with smaller radius. Is that an algorithmic problem? Seems surprising since the naive algorithm should scale like n*r I thought.

Third, DynamicAxisWarping.dtw_cost is slower than DynamicAxisWarping.dtw at r=100; that seemed strange too to me. (Also, DynamicAxisWarping.dtw doesn't accept a radius argument?).

Anyway, if the 3 orders of magnitude of perf gains over the python package are correct, then only a 50x speedup means maybe their implementation is actually quite slow... However, the algorithm could still be fast when implemented well.

baggepinnen commented 4 years ago

Oj, thanks for the benchmarks, nice to see we're doing well :) A few things come to mind

I know very little about calling python from Julia, but I've also understood that memory should be shared and not copied. Python code is seldom written taking allocations into mind so I'm not surprised their code allocates. For the code in this package, I've seen to that it should be possible to allocate almost nothing. It might also be that the python package is just very slow https://github.com/slaypni/fastdtw/issues/38.
FastDTW can be faster than vanilla DTW, but not always. See https://arxiv.org/pdf/2003.11246.pdf for an analysis. I have't spent much time on the fastdtw implementation, it's possible it can be improved.
dtw_cost should be faster than dtw if the radius is small in relation to the length of the series. If the radius is comparable to the length of the series, I'm not surprised if dtw is as fast or faster, because it uses SIMD to calculate all distances in a batch very efficiently (for >1 dim like your example, BLAS is used). If you need a very large radius, you might actually need a sliding distance ala dtwnn.
dtw is much more flexible than dtw_cost and accepts arbitrary bounds on the warping path. To construct simple radius bounds, try
```
imin,imax = radiuslimits(20,100,100), plot([imin imax])
dtw(x, y, dist, imin, imax) # Cost eqivalent to dtw_cost(x, y, dist, 20)
```
Note also that some implementations do normalization of the input series before calculating the distance, by default, we do not do that here.

ericphanson commented 4 years ago

Ah thanks very much for the clarifications. I didn't realize that difference with dtw but that makes sense; I see it uses the pairwise methods defined in Distances. And great job on the performance btw!

Btw, should https://github.com/baggepinnen/DynamicAxisWarping.jl/blob/403060431f53bd8e2377855ac80d31042d074658/src/dtw.jl#L28 have a transpose instead of an adjoint? Probably complex-valued data isn't super common but you never know what'll be in an AbstractVector :).

baggepinnen commented 4 years ago

Thanks :)

have a transpose instead of an adjoint?

Good catch! I'm not often a complex kind-of guy, but we should definitely do the right thing here.

Probably complex-valued data isn't super common

True, but I like to think about it like this: If you do have an exotic number type you are probably going to look at a Julia implementation, because what options do you really have? :smile:

ericphanson commented 4 years ago

True :). I've run into this exact issue before as well (on Convex.jl, which updated for complex number support over a GSOC a couple years ago but missed an adjoint somewhere, leading to lots of confusing bugs). Looks like there are some extra adjoints in Distances.jl too; I'll try to file an issue or PR next week.

By the way, I've been using the following locally for fast zero-allocation euclidean-squared norm DTW, modified from this package and Distances.jl for computing DTW's between a bunch of timeseries all the same size. It's a bit faster than dtw while still using BLAS. Putting caches in the distance object feels like a nice pattern since you can just create as many distance objects as you have threads to do it multithreaded. I'm not sure it's worth putting in DynamicAxisWarping though since you might not want to have to make a new distance object for each length of timeseries. (I've only tested it on matrices also). But I figured I'd put it here in case it was useful.

using LinearAlgebra

lastlength(x) = size(x, ndims(x))

# allocation-free version of `out = sum(abs2, in, dims=1)`
function sum_abs2_cols!(out, in)
    @inbounds for j = 1:size(in, 2)
        outj = zero(eltype(out))
        for i = 1:size(in, 1)
            outj += abs2(in[i,j])
        end
        out[j] = outj
    end
    nothing
end

function pairwise_sq_euclidean!(r::AbstractMatrix, sa2::AbstractVector, sb2::AbstractVector,
    a::AbstractMatrix, b::AbstractMatrix)
    mul!(r, transpose(a), b)
    sum_abs2_cols!(sa2, a)
    sum_abs2_cols!(sb2, b)
    for j = 1:size(r, 2)
        sb = sb2[j]
        @simd for i = 1:size(r, 1)
            @inbounds r[i, j] = max(sa2[i] + sb - 2 * r[i, j], 0)
        end
    end
    nothing
end

function dtw_cost_matrix!(D::AbstractMatrix{T}, sa2::AbstractVector{T}, sb2::AbstractVector{T}, seq1::AbstractArray{T}, seq2::AbstractArray{T}) where T
    # Build the cost matrix
    m = lastlength(seq2)
    n = lastlength(seq1)

    # Initialize first column and first row
    pairwise_sq_euclidean!(D, sa2, sb2, seq2, seq1)
    # pairwise!(D, dist, seq2, seq1, dims=2)
    @assert size(D) == (m,n)

    @inbounds for r=2:m
        D[r,1] += D[r-1,1]
    end
    @inbounds for c=2:n
        D[1,c] += D[1,c-1]
    end

    # Complete the cost matrix
    @inbounds for c = 2:n
        for r = 2:m
            D[r, c] += min(D[r-1, c], D[r-1, c-1], D[r, c-1])
        end
    end

    nothing
end

# Holds caches
struct DTW{T}  <: Distances.SemiMetric
    D::Matrix{T}
    sa2::Vector{T}
    sb2::Vector{T}
end

DTW{T}(n) where {T} = DTW{T}(zeros(T, n, n), zeros(T, n), zeros(T, n))

DTW(n) = DTW{Float32}(n)

function (dist::DTW)(a,b)
    dtw_cost_matrix!(dist.D, dist.sa2, dist.sb2, a, b)
    dist.D[end,end]
end

baggepinnen commented 4 years ago

This code appears to be some 10% faster when run single threaded, but I can imagine it gets quite a bit faster if run with multiple threads since the allocations go away. I'm curious though, what application are you using this on that does not warrant any restrictions at all on the warping path?

ericphanson commented 4 years ago

Good question, I'm not really sure the application (comparing spikes in EEG signals) warrants no restrictions at all, but I want to try out different methods and figured I'd start without restrictions. I think some regularization would make sense (which is why I'm a bit interested in this GDTW paper).

baggepinnen commented 4 years ago

Ah, I recently read some paper where they did more or less that. I think they manually aligned the peaks of the heartbeats first and then used some radius-limited or diamond-limited DTW distance. I guess that it doesn't make much sense comparing two series with different number of heart beats? It feels like restricting the warping path, at least enough so that two beats cannot be matched to one etc. could make sense, and speed it it quite significantly

ericphanson commented 4 years ago

So heartbeats is probably ECG (electrocardiogram), I’m looking at EEGs (electroencephalogram), so brain instead of heart, and unfortunately no heartbeats to align. But the segments of recording I'm looking at currently are quite short (like my benchmark example) and centered on features, so it's more about matching up features, and I'm not sure restrictions on the path will give a big speedup (by those benchmarks above). But such restrictions might help get better matches. I am also thinking about other underlying distances other than Euclidean; I feel like a (time domain) optimal transport distance with the right cost matrix might be more meaningful.

baggepinnen commented 4 years ago

Ah my bad, I often confuse the two ^^ I have experimented a bit with combining DTW and OT, here's an example and here's another one in the form of a notebook.

One issue I have is with signals that are not really measures, I have to perform some normalization before applying OT. If it's just normalizing the mass I can live with that, or apply unbalanced transport, but if the signals vary around 0, one would have to offset the signals to be positive, and then possibly normalize, and I feel like this is a much bigger hack. My application is spectrograms, which are positive, but not when looked at in log-power. For such signals, I have to decide on a "floor" which I offset the spectrogram with and truncate to 0 below.

I don't know much about EEG, but I can imagine you would face similar issues here?

ericphanson commented 4 years ago

Thanks for the links! Yes, I have that issue too, they are zero-centered. So maybe OT isn't really the right approach...

baggepinnen / DynamicAxisWarping.jl

A General Optimization Framework for Dynamic Time Warping #16