Similarity or Distance - Higher is closer or lower is closer?

Lucew commented 1 month ago

First of all, thanks for creating this awesome package. I want to use it to evaluate the measures for finding related time series within industrial/building time series datasets.

For that, I have the following questions: Is there a place where I can quickly gather whether the outcome of a measure is a similarity or a distance? In other words, whether something is more related if the number is higher (similarity) or more related if the metric is lower (distance).

The simplest example would be the differentiation between correlation and Euclidean distance of two time series. For related time series, one could expect the correlation to be high but the Euclidean distance to be low. When, for example, searching for the k most related time series throughout the different metrics implemented here, this information is of great interest.

I'm sorry if I missed that in the publication or the paper. Kindly refer me if there is information on that anywhere.

fkiraly commented 1 month ago

@Lucew, related to this issue: https://github.com/DynamicsAndNeuralSystems/pyspi/issues/72

Do you have code that loops over all the distances or similarities? (if we would move to the tagged record structure, we can add a tag for type of metric)

benfulcher commented 1 month ago

Hi @Lucew, thanks for the question. We've never seen an application that required this to be so clearly differentiated, but I agree it would be useful to label, as metadata. There is no one currently actively maintaining this package, and I'll be on leave for a while, but I will add this to the list of new functionalities. Currently, you will have to work this out yourself by understanding the algorithm (or broadly, but looking for correlations to known correlation metrics or known distance metrics—in many cases many features will be strongly correlated or anticorrelated).

Lucew commented 1 month ago

@fkiraly Yes, that is precisely how I approach it. After reading your post, I did something similar to what you mentioned in #72. To allow for time-outs of metrics and being able to distribute calculations over the metrics for a single dataset, I split up the YAML file for configuration "fast" into multiple ones via an automated code and then run Pyspi numerous times in various processes via a management script. This allows me to compute single SPIs and parallelize the package for one dataset, which is necessary as I have very high dimensionality for my dataset (230-400 time series, in one case even 1600).

Config Walker

```python3 def iterate_config(config_path: str): # load the config we want to use for the pyspi run with open(config_path, 'r') as file: config = yaml.load(file, Loader=yaml.FullLoader) # go through the config and write single configs out of there for spi_type, spi_names in config.items(): for spi_name, spi_configs in spi_names.items(): # create the dict we want to keep keep_dict = {key: val for key, val in spi_configs.items() if key != 'config'} # check whether we have configs then iterate over those spi_configs = spi_configs.get('configs', None) if spi_configs is None: keep_dict['configs'] = None yield {spi_type: {spi_name: keep_dict}} else: for config in spi_configs: # this needs to be a list otherwise it will fail! keep_dict['configs'] = [config] yield {spi_type: {spi_name: keep_dict}} ```

Having public-facing APIs for every SPI would have made this more direct.

@benfulcher Thanks for the quick reply. I'm fully aware that my application is outside of the intended usage. This collection of SPIs is just so useful for me. When discovering relations between time series, linear correlation is most often used in my domain, and having a collection of other measures allows me to check whether some are transferable, is just grand for my research. I will work out this information independently then, most likely by mixing your proposed ways. If there is time to integrate this, I'm sure others will also profit later. On your list of papers, there is also already an ArXiv paper, which ran into a similar problem, according to my understanding, where they just inverted the most obvious distances by multiplying with -1.

Once I'm done, I could provide a preliminary list of which SPIs are distances and which are similarities in this issue, if there is time.

fkiraly commented 3 weeks ago

@benfulcher, if no one is actively maintaining the package, would you consider a maintenance collaboration, with a refactor towards #72?

DynamicsAndNeuralSystems / pyspi

Similarity or Distance - Higher is closer or lower is closer? #73