asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
254 stars 29 forks source link

Warnings with non-symmetric distance #38

Closed wirginiad closed 5 years ago

wirginiad commented 5 years ago

Hi, I am trying to use CDMDistance TSclust "CDMdistance" and to compare cvis for different number of clusters. After proxy::dist(data, method = "CDMdis") p1<-tsclust(data, type="hierarchical",k=2:5, distance="CDMdis", control=hierarchical_control(method="ward.D") I get warning Distance matrix is not symmetric, and hierarchical clustering assumes it is (it ignores the upper triangular). After sapply(p1, cvi, type = "internal") the indices are provided, but there are: Warning messages: 1: In FUN(X[[i]], ...) : Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for: Sil D COP I guess there is something I am doing wrong. I'd be grateful if you could help me.

asardaes commented 5 years ago

Some distances are not symmetric, and others are only symmetric under certain circumstances. The documentation of the CDM states:

While this dissimilarity is asymptotically symmetric, for short series the differences between diss.CDM(x,y) and diss.CDM(y,x) may be noticeable.

Many functions assume distances are symmetric, including proxy::dist when you only pass x:

library(TSclust)
library(dtwclust)

set.seed(319L)
series <- lapply(1L:4L, function(.) { rnorm(10L, 10, 10) })

proxy::pr_DB$set_entry(FUN=diss.CDM, names="CDMdis", distance=TRUE, loop=TRUE)

dm <- proxy::dist(series, method="CDMdis")
# TRUE
base::isSymmetric(base::as.matrix(dm))

dm <- proxy::dist(series, series, method="CDMdis")
# FALSE
base::isSymmetric(base::as.matrix(dm))

Hierarchical clustering and some CVIs also assume symmetry. For example, hclust takes a "dist" structure as input, which is essentially the lower triangular with some extra information:

# TRUE
all(as.dist(as.matrix(dm)) == dm[lower.tri(dm)])

So some functions basically ignore information when the distance is not symmetric. Maybe this difference is small, due to numerical precision or the like, and you can ignore it, but you need to be aware of it. Hence the warnings. If those differences shouldn't be ignored, then maybe that distance is not suitable for your data in this case.

asardaes commented 5 years ago

Also note that the distances included in dtwclust have custom proxy loops, so they don't assume symmetry based on whether only x or both x and y were provided. For example, SBD is always symmetric, but it is never safe to assume that lb_keogh or lb_improved are symmetric, so something like proxy::dist(series, method="lb_keogh", window.size=1L) will always calculate the whole matrix, not just the lower triangular.

wirginiad commented 5 years ago

Thank you for your help and sorry for the misleading title (I started with an error, which I fixed). The CDM measure was suggested by some research as most suitable one for macroeconomic data and I use it as a kind of robustness check. I didn't know that some CVIs require symmetry. I guess I must delve into CVIs more deeply.