Compare similarity of QBO to other geolocations with multiple metrics

pawelbielski commented 4 years ago

We need a flexible solution to plot and compare different similarity metrics. Create the following functionality in the separate python file plots.py. The goal is to have an function that would allow for flexible comparisons of maps generated with different similarity metrics and different data_maps, or reference data series:

[x] Create a function that takes: data_map, reference data series (e.g. qbo), a list of similarity metrics and plots results for all similarity metrics in separate columns
[x] Integrate different modes of vizualization: for the whole period, for every month separately, for winter months only.
[x] Present the functionality in a separate jupyter notebook

pierretoussing commented 4 years ago

@pawelbielski I struggle with the colorbar for different metrics for different months, because the values of the different metrics are not in the same range (hardcoding the colorbar is not an option) and the ranges for the same metric but different months also varies (for example range for MI in January is not the same as range for MI in April). The only "correct" option is to draw a colorbar for every subplot, but this does not look good.

pawelbielski commented 4 years ago

@pierretoussing Comments on 7_compare_similarity_metrics.ipynb:

[x] I don't understand the value range for cosine similarity. I should be in range 0-1, when 1 means the most similar. The plots show values above 1 and the similarity between QBO and equatorial region is near zero.
[x] Transfer Entropy also looks inverted. Notice that we expect high similarity with QBO with equatorial region. Or is the highest similarity 0 in transfer entropy
[x] Mutual information plot looks fine.

Different ranges of values (or possibly inverted values) of various similarity metrics are indeed challenging. There are a few options to deal with that:

We can leave them for now as they are; they will scale automatically between max and min ranges as you plot them. This should already be useful to us, e.g. for plotting the scatterplots and maps of similar values
Later you can think of normalizing/scaling them meaningfully. The easiest way is of course to normalize them linearly between max and min (this is what the the plots do automatically if you dont' specify anything).
Another idea could be to use a nonlinear transformation that would fix the values to a given distribution. For example: Given that Mutual Information returns most values that are small, and only few big ones, we could transform them so that the top 10% values have the same data points as the bottom 10% values. I am sure there are already similar approaches in the Internet, if not we will create such transformation by ourselves.
We could also ignore the scaling and directly focus on meaningful plotting of the correlations between different similarity metrics. Here however we would probably also have to scale the values at some point, but at least we could have a more meaningful visual result that we could get feedback on.
In general, comparing metrics that have different ranges seems to be a common problem. You should try to find something in formal or informal (blogs, tutorials) literature. If we cannot easily find any solution that fits our needs, this could impact the story of your Thesis, e.g. by proposing such an approach and validating its usefulness.

pierretoussing commented 4 years ago

The cosine similarity was not correct. The scipy.spatial.distance.cosine calculates the cosine distance which is actually 1 - cosine similarity. I fixed it. The range for the cosine similarity now is [-1, 1]. Which makes more sense.
Transfer entropy quantifies the information transfer between an information source and destination (here two time series). So the nearer the value is to 0, less information is needed to be transfered: The information source and destination are more similar. In this context, little value means more similarity.

Maybe I should add the value ranges and their meanings for the different similarity metrics in the respective docs?

pawelbielski commented 4 years ago

@pierretoussing I think it is important for the interpretability to make sure the values are comparable. That means, it could make sense to plot 1 - transfer_entropy instead of transfer_entropy directly just for the sake of interpretability.

Also, what do you think about the ideas to tackle different ranges of values of similarity metrics I listed above? Which of them has the most potential in your opinion? Do you have other ideas, or did you maybe find something new?

pierretoussing commented 4 years ago

I think the best way to make them comparable is to scale them all to [0, 1] with 1 meaning the most similar and 0 meaning the most dissimilar.

This would mean that some metrics like Transfer Entropy have to be inverted and other metrics like Manhattan Distance have to be compressed to the correct interval.

pierretoussing commented 4 years ago

@pawelbielski Should this be done directly in the similarity_measures.py or only for the plotting in the plots.py?

pawelbielski commented 4 years ago

The scale [0,1] does not include information about a sign. How about [-1, 1]? I suggest to have the scaling funcionality as an exchangeable module. Hardcoding it into similarity_measures.py similarity functions would restrict the future comparisons with non-scaled values. Maybe including it as a scaling function in calculations.py?

Also, I think the scaling ( both [0-1], and [-1,1]) does not solve the problem of different shapes of curves on the scatter plot. This causes almost whole map to be violet with some parts being yellow. The possible solution I described in #16 .

pierretoussing commented 4 years ago

Ok @pawelbielski , I implemented a function binning_values_to_quantiles that converts a map of values into the respective bin numbers the values belongs to. In this case 0.3 means this value belongs to the 20%-30% bin which contains the 20%-30% smallest values of the map. All the bins have the same size.

So now all values are between [0, 1] and some metrics like Transfer Entropy still have to be inverted (will be done).

But how can we "compare" the respective values still remains unclear. One possibility would be to multiply the value for one metric with the value for the other metric which returns a number in [0.01, 1] with 1 meaning this two similarity metrics totally agree.

Another option would be to sum the two values and get a number in [0.2, 2] with 2 meaning this two similarity metrics totally agree.

I think there are of course a lot of other possibilites. In order to keep it modular I would suggest to write a function like this:

[x] combine_similarity_metrics() which takes a data map, a reference series, a list of metrics to compute similarity between data map and reference series, and a combination function (like adding, multiplying, ...) to combine two similarity values (binned to [0.1, 1]) into one.

And then plot the result.

pierretoussing commented 4 years ago

The issue of making different similarity metrics comparable is treated in #18

Climate-Data-Science / Climate-Similarity-Metrics

Compare similarity of QBO to other geolocations with multiple metrics #15