Closed bouweandela closed 5 months ago
assigned myself on this one, my intention is to start looking into developing a serious statistical module for ESMValTool, this is a good starting point
Regarding this issue/enhancement here there is information that I hope will be useful.
It might be interesting to check the R-Forge libraries, for instance, those related with the Wilcox robust statistics functions (https://rdrr.io/rforge/WRS/man/) or those in robustbase (https://rdrr.io/rforge/robustbase/man/). Some of them are already implemented on scipy but actually not all. It is useful to have in mind the package rpy2 for reuse or double checking.
In general the Pearson cross-correlation is not robust and assumes similar properties on the joint-distribution than the linear-regression. However, there are slight improvements that could solve at least the outlier dependency: like the percentage bend correlation coefficient (https://link.springer.com/article/10.1007/BF02294395) or Winsorized-correlation (that only relies on the trimmed mean and trimmed var ).
About the ksamples methods like those above mentioned, Anderson-Darling, Kruskal-Wallis etc, the ksamples package has information but it needs to know something about rank based tests. Other possibilities are rank correlation measures.
We will also support R diagnostics in the near future, see https://github.com/ESMValGroup/ESMValTool/pull/631, so no need to use rpy2.
Feel free to re-open if anyone has plans to do this.
So we do not forget the discussion in #596, this code https://github.com/ESMValGroup/ESMValTool/blob/60a89f7828025c599615bcb5932b1917a40fb333/esmvaltool/diag_scripts/examples/correlate.py#L48
should probably be updated so it uses either: scipy.stats.mstats.ks_twosamp scipy.stats.ks_2samp or this: scipy.stats.anderson_ksamp as some people seem a bit critical about the KS test.