GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
290 stars 51 forks source link

Metrics: Soft dynamic-time-warping divergence for FDataGrid #412

Open Clej opened 2 years ago

Clej commented 2 years ago

Content I'm proposing the soft dynamic-time-warping (SDTW) divergence as a metric between FDataGrid objects. The domain dimension (time) must be one. The co-domain dimension can be greater than one. I use it for my work and noted in #363 that asked classical dtw implementation.

This metric is a smooth version of the DTW which is defined by a min operator and depends on a pre-defined alignment cost (ie a discrepancy measure between all pairs (x(t_i), x(t_j))), eg squared euclidean cost, or roughly speaking a Gram-matrix. In SDTW, the min operator is replaced by the soft-min (through a negative log-sum-exp like function). The smoothness is controlled by gamma. SDTW divergence is positive, symmetric, indiscernable (SDTW(X, Y) = 0 <=> X=Y) but does not satisfy the triangle inequality (depends on the alignment cost). See "Differentiable Divergences Between Time Series", Blondel et al., 2021 AISTATS. Thus it can be used like a distance for clustering, k-nn classification, etc.

The main advantages of the SDTW divergence are:

Implementation Since SDTW is defined by the soft-min operator, computing SDTW divergence between two FDataGrid objects, X and Y, requires an iterative procedure to solve the minimization problem, that we solve with a dynamic-programming recursion in O(mn) where m and n are the respective size of the grid points of X and Y. I coded the recursion with ctyhon in sdtw_fast.pyx, and so have added a setup.py in ~/misc/metrics to generate the .c extension.

Improvements

Usage:

import numpy as np
from skfda.datasets import make_sinusoidal_process
from skfda.misc.metrics import sdtwDivergence
from sklearn.gaussian_process.kernels import DotProduct, RBF

n = 50
n_grid = 10**3
x_curves = make_sinusoidal_process(
    n_samples=n,
    n_features=n_grid,
    start=0.0,
    stop=50.0,
    period=30.0,
    phase_mean=3.0,
    amplitude_mean=5.0,
    amplitude_std=1.0,
    random_state=11
)

# default alignment cost is the half squared euclidean distance
sdtw_hsqe= sdtwDivergence(gamma=1.0, cost=None)
sdtw_hsqe(x_curves[0], x_curves[1])

# which is equivalent to
cost_hsqe = DotProduct(sigma_0=0)
def cost_lin(X, Y):
    mat = -1 * cost_hsqe(X, Y)
    mat += 0.5 * cost_hsqe.diag(X)[:, np.newaxis]
    mat += 0.5 * cost_hsqe.diag(Y)
    return mat

sdtw_hsqe= sdtwDivergence(gamma=1.0, cost=cost_lin)
sdtw_hsqe(x_curves[0], x_curves[1])

# the cost can be a non-linear kernel:
rbf = RBF(length_scale=1.0)
def cost_rbf(X, Y):
    mat = -1 * rbf(X, Y)
    mat += 0.5 * rbf.diag(X)[:, np.newaxis]
    mat += 0.5 * rbf.diag(Y)
    return mat

sdtw_rbf= sdtwDivergence(gamma=1.0, cost=cost_rbf)
sdtw_rbf(x_curves[0], x_curves[1])

Any other suggestion

codecov[bot] commented 2 years ago

Codecov Report

Base: 85.53% // Head: 85.06% // Decreases project coverage by -0.47% :warning:

Coverage data is based on head (f43c73b) compared to base (e69154b). Patch coverage: 26.66% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## develop #412 +/- ## =========================================== - Coverage 85.53% 85.06% -0.47% =========================================== Files 141 143 +2 Lines 11280 11370 +90 =========================================== + Hits 9648 9672 +24 - Misses 1632 1698 +66 ``` | [Impacted Files](https://codecov.io/gh/GAA-UAM/scikit-fda/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM) | Coverage Δ | | |---|---|---| | [skfda/misc/\_math.py](https://codecov.io/gh/GAA-UAM/scikit-fda/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM#diff-c2tmZGEvbWlzYy9fbWF0aC5weQ==) | `84.37% <ø> (ø)` | | | [skfda/misc/metrics/\_sdtw\_distances.py](https://codecov.io/gh/GAA-UAM/scikit-fda/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM#diff-c2tmZGEvbWlzYy9tZXRyaWNzL19zZHR3X2Rpc3RhbmNlcy5weQ==) | `24.63% <24.63%> (ø)` | | | [skfda/misc/metrics/\_sdtw\_numba.py](https://codecov.io/gh/GAA-UAM/scikit-fda/pull/412?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM#diff-c2tmZGEvbWlzYy9tZXRyaWNzL19zZHR3X251bWJhLnB5) | `33.33% <33.33%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=GAA-UAM)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Clej commented 2 years ago

I wrote de code in python and used @jit decorators with fastmath=True and cache=True. Computation times are comparable to my cython version.

Clej commented 2 years ago

What is missing is the test part: any suggestion ? Checking expected properties of divergence like symmetry, positivity, indiscernibility would be enough ? @vnmabus