Closed ericphanson closed 4 years ago
Oh, I just saw the code doesn't exist... https://github.com/dderiso/gdtw :cry:
Thanks for the reference Eric, I had not seen it before (I had never looked at DTW until a few weeks ago). Their approach is probably more principled than the transportcost
which is a completely adhoc addition, but which seems to do what I wanted :angel:
The barycenter algorithm in this package comes from the authors of the repo I forked and is the most simple DTW barycenter algorithm. It's known to produce weird barycenters (but reasonable fast). I'll have a look at the article you linked and see if some of their stuff would fit nicely in here :)
I've only recently gotten into DTW too.
I wanted to do a comparison to see if a 50x speedup was a good sign, so I did some simple benchmarks:
julia> using PyCall, DynamicAxisWarping, BenchmarkTools, Distances
julia> const FASTDTW = pyimport_conda("fastdtw", "fastdtw")
PyObject <module 'fastdtw' from '/home/eric/.julia/conda/3/lib/python3.7/site-packages/fastdtw/__init__.py'>
julia> x = rand(20, 100);
julia> y = rand(20, 100);
julia> r = 100
100
julia> xp = permutedims(x);
julia> yp = permutedims(y);
julia> @btime DynamicAxisWarping.dtw($x,$y, $(SqEuclidean()));
88.584 μs (11 allocations: 85.33 KiB)
julia> @btime DynamicAxisWarping.fastdtw($x,$y, $(SqEuclidean()), $r);
88.693 μs (11 allocations: 85.33 KiB)
julia> @btime DynamicAxisWarping.dtw_cost($x,$y, $(SqEuclidean()), $r, s1 = $(zeros(2r+1)), s2 = $(zeros(2r+1)));
128.094 μs (0 allocations: 0 bytes)
julia> @btime FASTDTW.fastdtw($xp, $yp; radius = $r, dist=$(SqEuclidean()));
139.339 ms (861612 allocations: 36.06 MiB)
julia> r = 20
20
julia> @btime DynamicAxisWarping.fastdtw($x,$y, $(SqEuclidean()), $r);
280.930 μs (213 allocations: 179.83 KiB)
julia> @btime DynamicAxisWarping.dtw_cost($x,$y, $(SqEuclidean()), $r, s1 = $(zeros(2r+1)), s2 = $(zeros(2r+1)));
47.091 μs (0 allocations: 0 bytes)
julia> @btime FASTDTW.fastdtw($xp, $yp; radius = $r, dist=$(SqEuclidean()));
256.915 ms (1104904 allocations: 46.42 MiB)
A couple things stood out to me: first, we're seeing 3 orders of magnitude perf improvements in DynamicAxisWarping.jl over the python fastdtw
! But there's also a ton of allocations there. I wonder if I'm not calling it correctly? (The permutedims
are to align with row-majorness of numpy, and are needed to get the same answers). My understanding is PyCall shares memory between Julia arrays and numpy arrays, so it should be pretty efficient, I'd think.
Second, I see worse perf in DynamicAxisWarping.fastdtw
and FASTDTW.fastdtw
with smaller radius. Is that an algorithmic problem? Seems surprising since the naive algorithm should scale like n*r
I thought.
Third, DynamicAxisWarping.dtw_cost
is slower than DynamicAxisWarping.dtw
at r=100
; that seemed strange too to me. (Also, DynamicAxisWarping.dtw
doesn't accept a radius argument?).
Anyway, if the 3 orders of magnitude of perf gains over the python package are correct, then only a 50x speedup means maybe their implementation is actually quite slow... However, the algorithm could still be fast when implemented well.
Oj, thanks for the benchmarks, nice to see we're doing well :) A few things come to mind
fastdtw
implementation, it's possible it can be improved.dtw_cost
should be faster than dtw
if the radius is small in relation to the length of the series. If the radius is comparable to the length of the series, I'm not surprised if dtw
is as fast or faster, because it uses SIMD to calculate all distances in a batch very efficiently (for >1 dim like your example, BLAS is used). If you need a very large radius, you might actually need a sliding distance ala dtwnn
.dtw
is much more flexible than dtw_cost
and accepts arbitrary bounds on the warping path. To construct simple radius bounds, try
imin,imax = radiuslimits(20,100,100), plot([imin imax])
dtw(x, y, dist, imin, imax) # Cost eqivalent to dtw_cost(x, y, dist, 20)
Note also that some implementations do normalization of the input series before calculating the distance, by default, we do not do that here.
Ah thanks very much for the clarifications. I didn't realize that difference with dtw
but that makes sense; I see it uses the pairwise methods defined in Distances. And great job on the performance btw!
Btw, should https://github.com/baggepinnen/DynamicAxisWarping.jl/blob/403060431f53bd8e2377855ac80d31042d074658/src/dtw.jl#L28 have a transpose instead of an adjoint? Probably complex-valued data isn't super common but you never know what'll be in an AbstractVector
:).
Thanks :)
have a transpose instead of an adjoint?
Good catch! I'm not often a complex kind-of guy, but we should definitely do the right thing here.
Probably complex-valued data isn't super common
True, but I like to think about it like this: If you do have an exotic number type you are probably going to look at a Julia implementation, because what options do you really have? :smile:
True :). I've run into this exact issue before as well (on Convex.jl, which updated for complex number support over a GSOC a couple years ago but missed an adjoint somewhere, leading to lots of confusing bugs). Looks like there are some extra adjoints in Distances.jl too; I'll try to file an issue or PR next week.
By the way, I've been using the following locally for fast zero-allocation euclidean-squared norm DTW, modified from this package and Distances.jl for computing DTW's between a bunch of timeseries all the same size. It's a bit faster than dtw
while still using BLAS. Putting caches in the distance object feels like a nice pattern since you can just create as many distance objects as you have threads to do it multithreaded. I'm not sure it's worth putting in DynamicAxisWarping though since you might not want to have to make a new distance object for each length of timeseries. (I've only tested it on matrices also). But I figured I'd put it here in case it was useful.
using LinearAlgebra
lastlength(x) = size(x, ndims(x))
# allocation-free version of `out = sum(abs2, in, dims=1)`
function sum_abs2_cols!(out, in)
@inbounds for j = 1:size(in, 2)
outj = zero(eltype(out))
for i = 1:size(in, 1)
outj += abs2(in[i,j])
end
out[j] = outj
end
nothing
end
function pairwise_sq_euclidean!(r::AbstractMatrix, sa2::AbstractVector, sb2::AbstractVector,
a::AbstractMatrix, b::AbstractMatrix)
mul!(r, transpose(a), b)
sum_abs2_cols!(sa2, a)
sum_abs2_cols!(sb2, b)
for j = 1:size(r, 2)
sb = sb2[j]
@simd for i = 1:size(r, 1)
@inbounds r[i, j] = max(sa2[i] + sb - 2 * r[i, j], 0)
end
end
nothing
end
function dtw_cost_matrix!(D::AbstractMatrix{T}, sa2::AbstractVector{T}, sb2::AbstractVector{T}, seq1::AbstractArray{T}, seq2::AbstractArray{T}) where T
# Build the cost matrix
m = lastlength(seq2)
n = lastlength(seq1)
# Initialize first column and first row
pairwise_sq_euclidean!(D, sa2, sb2, seq2, seq1)
# pairwise!(D, dist, seq2, seq1, dims=2)
@assert size(D) == (m,n)
@inbounds for r=2:m
D[r,1] += D[r-1,1]
end
@inbounds for c=2:n
D[1,c] += D[1,c-1]
end
# Complete the cost matrix
@inbounds for c = 2:n
for r = 2:m
D[r, c] += min(D[r-1, c], D[r-1, c-1], D[r, c-1])
end
end
nothing
end
# Holds caches
struct DTW{T} <: Distances.SemiMetric
D::Matrix{T}
sa2::Vector{T}
sb2::Vector{T}
end
DTW{T}(n) where {T} = DTW{T}(zeros(T, n, n), zeros(T, n), zeros(T, n))
DTW(n) = DTW{Float32}(n)
function (dist::DTW)(a,b)
dtw_cost_matrix!(dist.D, dist.sa2, dist.sb2, a, b)
dist.D[end,end]
end
This code appears to be some 10% faster when run single threaded, but I can imagine it gets quite a bit faster if run with multiple threads since the allocations go away. I'm curious though, what application are you using this on that does not warrant any restrictions at all on the warping path?
Good question, I'm not really sure the application (comparing spikes in EEG signals) warrants no restrictions at all, but I want to try out different methods and figured I'd start without restrictions. I think some regularization would make sense (which is why I'm a bit interested in this GDTW paper).
Ah, I recently read some paper where they did more or less that. I think they manually aligned the peaks of the heartbeats first and then used some radius-limited or diamond-limited DTW distance. I guess that it doesn't make much sense comparing two series with different number of heart beats? It feels like restricting the warping path, at least enough so that two beats cannot be matched to one etc. could make sense, and speed it it quite significantly
So heartbeats is probably ECG (electrocardiogram), I’m looking at EEGs (electroencephalogram), so brain instead of heart, and unfortunately no heartbeats to align. But the segments of recording I'm looking at currently are quite short (like my benchmark example) and centered on features, so it's more about matching up features, and I'm not sure restrictions on the path will give a big speedup (by those benchmarks above). But such restrictions might help get better matches. I am also thinking about other underlying distances other than Euclidean; I feel like a (time domain) optimal transport distance with the right cost matrix might be more meaningful.
Ah my bad, I often confuse the two ^^ I have experimented a bit with combining DTW and OT, here's an example and here's another one in the form of a notebook.
One issue I have is with signals that are not really measures, I have to perform some normalization before applying OT. If it's just normalizing the mass I can live with that, or apply unbalanced transport, but if the signals vary around 0, one would have to offset the signals to be positive, and then possibly normalize, and I feel like this is a much bigger hack. My application is spectrograms, which are positive, but not when looked at in log-power. For such signals, I have to decide on a "floor" which I offset the spectrogram with and truncate to 0 below.
I don't know much about EEG, but I can imagine you would face similar issues here?
Thanks for the links! Yes, I have that issue too, they are zero-centered. So maybe OT isn't really the right approach...
Hi @baggepinnen,
I came across this paper today https://arxiv.org/abs/1905.12893, and thought it was quite nice; it seemed like a sensible framework for introducing regularization similar to your
transportcost
parameter. I'm also not quite sure how their "time centered mean" corresponds to your barycenters. They also claim 50x speed ups compared to the python FastDTW package, but I'm not sure it would be faster than your code here. I thought I'd mention it here in case it was interesting to you as well.Cheers, Eric