GAA-UAM / scikit-fda

Functional Data Analysis Python package
https://fda.readthedocs.io
BSD 3-Clause "New" or "Revised" License
298 stars 53 forks source link

How does one add DTW to metrics? #363

Open a-berg opened 3 years ago

a-berg commented 3 years ago

First of all let me thank you for this package, and say: great work!

Is your feature request related to a problem? Please describe. I am working with timeseries data. For now I have applied K-means with L2 norm with okay-ish results, however I wanted to try other distance functions to see if these results could be improved. This distance is already implemented in the package dtw-python. After some hours I haven't been able to make FDA K-means work with dtw, I have been trying to wrap the function but I am at a loss without knowing more about the inner workings of your library.

Describe the solution you'd like Some documentation on how to adapt pre-existing distance functions to your API. I am willing to contribute code if some guidance is provided.

Describe alternatives you've considered I have considered switching to scikit-learn. After all, DTW is meant for timeseries data so the functional approach that skfda provides isn't as critical as when applying Euclidean distance. I however think that in order to automate the process and do hyperparameter optimization, it's better if I only use skfda.

Additional context

vnmabus commented 3 years ago

First, I will explain things as they are in the current development branch, as we will release it as a new version in a few days. Remember that with pip you can install a package from a Github branch, even if it has not yet been released in PyPI.

A metric is a Python callable with the following definition (in your case MetricElementType should be FDataGrid, as DTW cannot be implemented for FDataBasis as far as I know):


class Metric(Protocol[MetricElementType]):
    """Protocol for a metric between two elements of a metric space."""

    @abstractmethod
    def __call__(
        self,
        __e1: MetricElementType,
        __e2: MetricElementType,
    ) -> np.ndarray:
        pass

It receives two FDataGrid objects with the same length, and returns a 1d array where the i-th element has the distance between the i-th elements of both FDataGrid.

You can either create a function with the prototype of __call__, or create a class (and maybe inherit from Metric[FDataGrid] so that Mypy helps you with the typing). The class is better if you want to pass parameters to the metric (such as p for a Lp metric).

In order to implement the function, you probably will need to access the grid_points attribute of the involved FDataGrid instances, as well as the actual function values in the data_matrix attribute. Those are explained in https://fda.readthedocs.io/en/latest/auto_tutorial/plot_getting_data.html#the-fdatagrid-class .

I am not familiar with dtw-python. I assume that it works only with univariate scalar-valued functions. In that case you probably will need to use gridpoints[0] and data_matrix[..., 0] in order to pick the grid points for the first (and only) dimension, and the values of the first (and only) coordinate of the function return.

That would be enough to use it. If you want to optimize also the pairwise computation of that metric, it is also possible to override the default pair-by-pair computation, as it is done for the L2 case here:

https://github.com/GAA-UAM/scikit-fda/blob/cf71fad434d49d06b34308705dd78e48ce1c96c6/skfda/misc/metrics/_lp_distances.py#L98-L136

I also wanted to say that if it is easier for you, scikit-fda is designed to be as compatible as possible with scikit-learn, and we reuse its hyperparameter tuning utilities. You can even mix both in the same Pipeline, see here for examples:

https://fda.readthedocs.io/en/latest/auto_tutorial/plot_skfda_sklearn.html

Note that as dtw-python has a GPL license, I cannot use it in our project, which has an incompatible BSD3 license. Thus, I cannot accept a PR that uses that library. If you implement it on your own or you use a more permissive library I could accept that implementation.