Climate-Data-Science / Climate-Similarity-Metrics

Which similarity metrics are the most helpful to understand climate
0 stars 2 forks source link

Compare similarity of QBO to other geolocations with multiple metrics #15

Closed pawelbielski closed 4 years ago

pawelbielski commented 4 years ago

We need a flexible solution to plot and compare different similarity metrics. Create the following functionality in the separate python file plots.py. The goal is to have an function that would allow for flexible comparisons of maps generated with different similarity metrics and different data_maps, or reference data series:

pierretoussing commented 4 years ago

@pawelbielski I struggle with the colorbar for different metrics for different months, because the values of the different metrics are not in the same range (hardcoding the colorbar is not an option) and the ranges for the same metric but different months also varies (for example range for MI in January is not the same as range for MI in April). The only "correct" option is to draw a colorbar for every subplot, but this does not look good.

pawelbielski commented 4 years ago

@pierretoussing Comments on 7_compare_similarity_metrics.ipynb:

Different ranges of values (or possibly inverted values) of various similarity metrics are indeed challenging. There are a few options to deal with that:

pierretoussing commented 4 years ago
  1. The cosine similarity was not correct. The scipy.spatial.distance.cosine calculates the cosine distance which is actually 1 - cosine similarity. I fixed it. The range for the cosine similarity now is [-1, 1]. Which makes more sense.

  2. Transfer entropy quantifies the information transfer between an information source and destination (here two time series). So the nearer the value is to 0, less information is needed to be transfered: The information source and destination are more similar. In this context, little value means more similarity.

Maybe I should add the value ranges and their meanings for the different similarity metrics in the respective docs?

pawelbielski commented 4 years ago

@pierretoussing I think it is important for the interpretability to make sure the values are comparable. That means, it could make sense to plot 1 - transfer_entropy instead of transfer_entropy directly just for the sake of interpretability.

Also, what do you think about the ideas to tackle different ranges of values of similarity metrics I listed above? Which of them has the most potential in your opinion? Do you have other ideas, or did you maybe find something new?

pierretoussing commented 4 years ago

I think the best way to make them comparable is to scale them all to [0, 1] with 1 meaning the most similar and 0 meaning the most dissimilar.

This would mean that some metrics like Transfer Entropy have to be inverted and other metrics like Manhattan Distance have to be compressed to the correct interval.

pierretoussing commented 4 years ago

@pawelbielski Should this be done directly in the similarity_measures.py or only for the plotting in the plots.py?

pawelbielski commented 4 years ago

The scale [0,1] does not include information about a sign. How about [-1, 1]? I suggest to have the scaling funcionality as an exchangeable module. Hardcoding it into similarity_measures.py similarity functions would restrict the future comparisons with non-scaled values. Maybe including it as a scaling function in calculations.py?

Also, I think the scaling ( both [0-1], and [-1,1]) does not solve the problem of different shapes of curves on the scatter plot. This causes almost whole map to be violet with some parts being yellow. The possible solution I described in #16 .

pierretoussing commented 4 years ago

Ok @pawelbielski , I implemented a function binning_values_to_quantiles that converts a map of values into the respective bin numbers the values belongs to. In this case 0.3 means this value belongs to the 20%-30% bin which contains the 20%-30% smallest values of the map. All the bins have the same size.

So now all values are between [0, 1] and some metrics like Transfer Entropy still have to be inverted (will be done).

But how can we "compare" the respective values still remains unclear. One possibility would be to multiply the value for one metric with the value for the other metric which returns a number in [0.01, 1] with 1 meaning this two similarity metrics totally agree.

Another option would be to sum the two values and get a number in [0.2, 2] with 2 meaning this two similarity metrics totally agree.

I think there are of course a lot of other possibilites. In order to keep it modular I would suggest to write a function like this:

And then plot the result.

pierretoussing commented 4 years ago

The issue of making different similarity metrics comparable is treated in #18