Closed FBdata closed 6 years ago
It seems like something is not working correctly in your R session, the silhouette
function calculates the index with the same name by calling compiled code .C(sildist, ...)
, and it's not finding the (compiled) sildist
function. Try reinstalling the cluster
package (which has the silhouette
function) or loading it explicitly with library
.
I reinstalled the cluster package and started a new R session.
The new error is :
sapply(pc, cvi, type = "valid") Error in silhouette.default(a@cluster, dmatrix = distmat) : NA/NaN/Inf dans un appel à une fonction externe (argument 1) De plus : Warning messages: 1: In FUN(X[[i]], ...) : Internal CVIs: series' cross-distance matrix is NOT symmetric, which can be problematic for: Sil D COP 2: In FUN(X[[i]], ...) : Internal CVIs: centroids' cross-distance matrix is NOT symmetric, which can be problematic for: DB DB*
which is indeed the same error if I use directly the silhouette function :
silhouette(pc[[4]]@cluster,pc[[4]]@distmat) Error in silhouette.default(pc[[4]]@cluster, d) : NA/NaN/Inf dans un appel à une fonction externe (argument 1)
If you use the function directly you have to specify that the distance matrix should go in the dmatrix
parameter. Either way, I guess you are getting NA/NaN/Inf in either @cluster
or, @distmat
, can you check? If that's the case, then maybe your series aren't playing nicely with DTW, maybe try other distances? Or leaving window.size = NULL
initially?
BTW, maybe also see the answer to this question.
Thanks for the quickly help! Indeed, I didn't see it before but I have some Inf in the @distmat. I used DTW because I thought it was the only distance dealing with unequal vector length. Am I maybe wrong ? I'll try to find an other one. Thanks also for the stackoverflow link. I already saw it and it didn't really help for my specific problem but I'll check it again.
I'll let you know if I have any problem or if I solve the issue.
The link is just in case you want to use the elbow method.
Do you have Inf in your input series? I don't think that is detected internally as an error by this package.
No, I don't have Inf in my input series!
I don't understand why dtw is not working well in my case.
FYI, SBD and GAK also support series with different lengths.
I'm still not sure why you would get an Inf as a result. Can you run
dm <- proxy::dist(list_imp_lag, list_imp_lag, method="dtw", window.type="slantedband", window.size=20L)
and check if dm
has any Inf values? If it does, then it is indeed a problem with your series' properties, otherwise there might be a bug in dtw_basic
, but I'd need your series to check.
Oh, I didn't know SBD and GAK, I need to see what are these distance for. Anyway, I tried to use them instead of "dtw-basic" and It perfectly works for clustering and evaluating the results of each k values :
sapply(pc, cvi, type = "valid")
gives me all the results of clustering for k from 2 to 15 👍
So, thanks a lot for that !! I'll learn about SBD and GAK theorically.
In the same time, i'm trying to run your line, first it did'nt work. I have error "error proxy dist Error in dtw(distance.only = TRUE, ...) :No warping path exists that is allowed by costraints" But I replaced "list_imp_lag" by a dataframe, and now it's running (very long). I'll let you know when I have results.
That error is probably why you're getting Inf with dtw_basic
; dtw
does more checks, but it's also slower (as you have seen). Changing to a data frame might not be the right workaround. Try leaving window.size=NULL
:
dm <- proxy::dist(list_imp_lag, list_imp_lag, method="dtw")
If that also fails, then the length difference between your series might be too large for DTW.
I just managed to reproduce the problem by using 2 series with very different lengths (40 and 410) and a small window size (5). You're going to have to play around with the value of window.size
to get appropriate results.
Hello,
I want to cluster time series of different length, and this R package is an amazing way to do it! Thanks for it.
However, I have some difficulties to find a method to evaluate the optimal number of cluster for my partitionnal clustering. When I do clustering (with more classical data, and distance metric), I'm used to obtain it through elbow method, silhouette... But I can't find how to do it in my actual case.
This is my actual case :
pc <- tsclust(list_imp_lag, type = "partitional", k = c(3:15), distance = "dtw_basic", centroid = "pam", seed = 3247L, trace = TRUE, args = tsclust_args(dist = list(window.size = 20L)))
where list_imp_lag is a list of 241 of numeric vector (extract below) :
List of 241 $ : num [1:720] 99650 1860 0 0 0 ... $ : num [1:254] 2830 0 0 0 0 0 0 0 0 0 ... $ : num [1:687] 28510 75121 0 0 0 ... $ : num [1:75] 5757 30288 0 0 0 ... $ : num [1:720] 20437 14563 9451 0 0 ... $ : num [1:84] 3430 0 0 0 0 0 0 0 0 0 ... $ : num [1:696] 3495 3157 0 0 0 ... $ : num [1:30] 13046 0 0 0 0 ... $ : num [1:38] 71305 848300 477887 0 0 ... $ : num [1:404] 179465 168423 144280 117150 5215 ... $ : num [1:119] 2694 0 0 0 0 ... $ : num [1:402] 32805 0 0 0 0 ... $ : num [1:51] 6979 31930 23705 22625 31117 ... $ : num [1:30] 24453 22145 16658 13891 12101 ...
My distance matrix is obviously not symetric in that situation.
I tried to use cvi function but got errors :
It would be very helpful if someone can help me with that problem.
Thanks in advance!