aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
966 stars 112 forks source link

[ENH] Make RDST use online shapelet distance #894

Open baraline opened 10 months ago

baraline commented 10 months ago

Describe the feature or idea you want to propose

Similarly to the online shapelet distance function in shapelet transform, dilated shapelet transform could use an "online" computation of the distance vector to avoid recomputing most parts of the distances.

Describe your proposed solution

The proposed solution may take inspiration from the sliding mean and std computation already implemented for dilated subsequences here.

Describe alternatives you've considered, if relevant

No response

Additional context

No response

TonyBagnall commented 9 months ago

I've just been looking at this, and Im not sure a) if the extra computation is worth the saving and b) whether the way we are doing this currently is the most efficient.

Im doing some stuff with STC atm, so I will report back here when I get to early abandon

baraline commented 9 months ago

Hi Tony,

Just to be sure, online Shapelet distance is the algorithmic equivalent of rolling mean and std computation (i.e keep a rolling sum), in order to only compute the last element of a new comparison instead of the whole comparison each time right ?

I don't see a way where doing this would result in a less efficient solution (maybe I'm missing some obvious detail tho), could we maybe discuss this in the next aeon dev meeting ?

EDIT : NVM, just sketched it, it is a bit less obvious that I was thinking !

baraline commented 9 months ago

I think the first thing we should study is how we want to compute the distance, then we can design the early abandon and pruning optimizations. In the context where we have a Shapelet S and a time series X, we compute the distance vector between the two (we won't really need to store it to extract statistics from it). Looking at a quick benchmark we have two cases, small dataset and "big" datasets:

ex_benchmark

For small ones, the use of the (squared) Euclidean distance from the distance module gives us the best results, for bigger series the convolution approach is faster (i.e. Mueen approach for normalized series also works for non normalized)

From what I understand of the online shapelet distance code, you keep rolling sums for the mean and stds, and look for best matches (min) in two directions (left/right) from a given position. You also sort shapelet indices in hope to trigger early abandon as early as possible.

As you mentioned, the first thing we need to figure out is whether using all those tricks worth it against the “non optimized” approaches (convolutions or simple distance) ? Note that in the non normalized case, the convolution could directly be used as lower bound for pruning, but it may not be worth it as the cost of the remaining computations are negligible.

I'll try to come up with a benchmark of the possible approaches for this in the future. I'll be glad to hear if you change something in the early abandon / pruning tricks.

Code to reproduce the graph:

import numpy as np
from numba import prange, njit
from aeon.distances import squared_distance
from scipy.spatial import distance
from scipy.signal import convolve

#########################################################

@njit()
def shapelet_distance_profile(X, S):
    n = X.shape[0]
    l = S.shape[0]
    dist_profile = np.zeros(n-l+1)
    for i in prange(n-l+1):
        dist_profile[i] = squared_distance(X[i:i+l], S)
    return dist_profile

#########################################################

@njit(fastmath=True)
def _convolve_shapelet_distance_profile(XS, X, S):
    l = S.shape[0]
    XS = -2*XS 
    XS += np.sum(S**2)

    _sum2 = 0
    for j in prange(l):
        _sum2 += X[j] ** 2

    XS[0] += _sum2
    for i in prange(1, XS.shape[0]):
        _sum2 += X[i+(l-1)] ** 2 - X[i-1] **2
        XS[i] += _sum2
    return XS

def convolve_shapelet_distance_profile(X, S):
    n = X.shape[0]
    l = S.shape[0]
    XS = convolve(np.flipud(S), X, mode='valid')
    return _convolve_shapelet_distance_profile(XS, X, S)

#########################################################

def rolling_window_stride_trick(X, window):
    a = np.asarray(X)
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def cdist_strides_distance(X, S):
    l = S.shape[0]
    X_strides = rolling_window_stride_trick(X, l)
    S_strides = rolling_window_stride_trick(S, l)
    return distance.cdist(X_strides, S_strides, 'sqeuclidean').flatten()

#########################################################

S = np.random.rand(10)
X = np.random.rand(100)

d1 = shapelet_distance_profile(X, S)
d3 = convolve_shapelet_distance_profile(X, S)
d2 = cdist_strides_distance(X, S)

assert np.allclose(d1, d2)
assert np.allclose(d1, d3)
assert np.allclose(d2, d3)
times1 = []
times2 = []
times3 = []
n_range = [100,500,1000,5000,10000,50000,100000]
for n in n_range:
    # shp length 10%
    S = np.random.rand(int(n*0.1))
    X = np.random.rand(n)
    r1 = %timeit -o shapelet_distance_profile(X, S)
    r2 = %timeit -o cdist_strides_distance(X, S)
    r3 = %timeit -o convolve_shapelet_distance_profile(X, S)
    times1.append(r1.average)
    times2.append(r2.average)
    times3.append(r3.average)

from matplotlib import pyplot as plt
plt.figure(figsize=(10,5))
plt.yscale("log")
plt.plot(times1, label='shapelet distance')
plt.plot(times2, label='cdist strides distance')
plt.plot(times3, label='convolve distance')
plt.xticks(ticks=range(len(n_range)), labels=n_range)
plt.xlabel("X size")
plt.ylabel("Average execution time")
plt.title("Time to compute distance vector between time series and shapelet (10% of X size)")
plt.legend()