asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

Getting Inf as a result when using dtw_basic #27

Closed FBdata closed 6 years ago

FBdata commented 6 years ago

Hello,

I want to cluster time series of different length, and this R package is an amazing way to do it! Thanks for it.

However, I have some difficulties to find a method to evaluate the optimal number of cluster for my partitionnal clustering. When I do clustering (with more classical data, and distance metric), I'm used to obtain it through elbow method, silhouette... But I can't find how to do it in my actual case.

This is my actual case :

pc <- tsclust(list_imp_lag, type = "partitional", k = c(3:15), distance = "dtw_basic", centroid = "pam", seed = 3247L, trace = TRUE, args = tsclust_args(dist = list(window.size = 20L)))

where list_imp_lag is a list of 241 of numeric vector (extract below) :

List of 241 $ : num [1:720] 99650 1860 0 0 0 ... $ : num [1:254] 2830 0 0 0 0 0 0 0 0 0 ... $ : num [1:687] 28510 75121 0 0 0 ... $ : num [1:75] 5757 30288 0 0 0 ... $ : num [1:720] 20437 14563 9451 0 0 ... $ : num [1:84] 3430 0 0 0 0 0 0 0 0 0 ... $ : num [1:696] 3495 3157 0 0 0 ... $ : num [1:30] 13046 0 0 0 0 ... $ : num [1:38] 71305 848300 477887 0 0 ... $ : num [1:404] 179465 168423 144280 117150 5215 ... $ : num [1:119] 2694 0 0 0 0 ... $ : num [1:402] 32805 0 0 0 0 ... $ : num [1:51] 6979 31930 23705 22625 31117 ... $ : num [1:30] 24453 22145 16658 13891 12101 ...

My distance matrix is obviously not symetric in that situation.

I tried to use cvi function but got errors :

sapply(pc, cvi, type = "valid") Error in silhouette.default(a@cluster, dmatrix = distmat) : objet 'sildist' introuvable De plus : Warning messages: 1: In FUN(X[[i]], ...) : Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for: Sil D COP 2: In FUN(X[[i]], ...) : Internal CVIs: centroids' cross-distance matrix is NOT symmetric, which can be problematic for: DB DB*

It would be very helpful if someone can help me with that problem.

Thanks in advance!

asardaes commented 6 years ago

It seems like something is not working correctly in your R session, the silhouette function calculates the index with the same name by calling compiled code .C(sildist, ...), and it's not finding the (compiled) sildist function. Try reinstalling the cluster package (which has the silhouette function) or loading it explicitly with library.

FBdata commented 6 years ago

I reinstalled the cluster package and started a new R session.

The new error is :

sapply(pc, cvi, type = "valid") Error in silhouette.default(a@cluster, dmatrix = distmat) : NA/NaN/Inf dans un appel à une fonction externe (argument 1) De plus : Warning messages: 1: In FUN(X[[i]], ...) : Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for: Sil D COP 2: In FUN(X[[i]], ...) : Internal CVIs: centroids' cross-distance matrix is NOT symmetric, which can be problematic for: DB DB*

which is indeed the same error if I use directly the silhouette function :

silhouette(pc[[4]]@cluster,pc[[4]]@distmat) Error in silhouette.default(pc[[4]]@cluster, d) : NA/NaN/Inf dans un appel à une fonction externe (argument 1)

asardaes commented 6 years ago

If you use the function directly you have to specify that the distance matrix should go in the dmatrix parameter. Either way, I guess you are getting NA/NaN/Inf in either @cluster or, @distmat, can you check? If that's the case, then maybe your series aren't playing nicely with DTW, maybe try other distances? Or leaving window.size = NULL initially?

asardaes commented 6 years ago

BTW, maybe also see the answer to this question.

FBdata commented 6 years ago

Thanks for the quickly help! Indeed, I didn't see it before but I have some Inf in the @distmat. I used DTW because I thought it was the only distance dealing with unequal vector length. Am I maybe wrong ? I'll try to find an other one. Thanks also for the stackoverflow link. I already saw it and it didn't really help for my specific problem but I'll check it again.

I'll let you know if I have any problem or if I solve the issue.

asardaes commented 6 years ago

The link is just in case you want to use the elbow method.

Do you have Inf in your input series? I don't think that is detected internally as an error by this package.

FBdata commented 6 years ago

No, I don't have Inf in my input series!

I don't understand why dtw is not working well in my case.

asardaes commented 6 years ago

FYI, SBD and GAK also support series with different lengths.

I'm still not sure why you would get an Inf as a result. Can you run

dm <- proxy::dist(list_imp_lag, list_imp_lag, method="dtw", window.type="slantedband", window.size=20L)

and check if dm has any Inf values? If it does, then it is indeed a problem with your series' properties, otherwise there might be a bug in dtw_basic, but I'd need your series to check.

FBdata commented 6 years ago

Oh, I didn't know SBD and GAK, I need to see what are these distance for. Anyway, I tried to use them instead of "dtw-basic" and It perfectly works for clustering and evaluating the results of each k values :

sapply(pc, cvi, type = "valid")

gives me all the results of clustering for k from 2 to 15 👍

image

So, thanks a lot for that !! I'll learn about SBD and GAK theorically.

In the same time, i'm trying to run your line, first it did'nt work. I have error "error proxy dist Error in dtw(distance.only = TRUE, ...) :No warping path exists that is allowed by costraints" But I replaced "list_imp_lag" by a dataframe, and now it's running (very long). I'll let you know when I have results.

asardaes commented 6 years ago

That error is probably why you're getting Inf with dtw_basic; dtw does more checks, but it's also slower (as you have seen). Changing to a data frame might not be the right workaround. Try leaving window.size=NULL:

dm <- proxy::dist(list_imp_lag, list_imp_lag, method="dtw")

If that also fails, then the length difference between your series might be too large for DTW.

asardaes commented 6 years ago

I just managed to reproduce the problem by using 2 series with very different lengths (40 and 410) and a small window size (5). You're going to have to play around with the value of window.size to get appropriate results.