asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

Silhouette width for TADPole method #33

Closed steipatr closed 6 years ago

steipatr commented 6 years ago

Hi, I am trying to use TADPole (and other methods) for time series clustering. I want to use silhouette width to compare different solutions for varying cluster counts k . For SBD, GAK, etc. I can easily extract the silhouette width, but for TADPole, I get the following report:

A second set of cluster membership indices is required in 'b' for this/these CVI(s).

In the cvi function (which I am using to get the silhouette widths), you give for b:

b - If needed, a vector that can be coerced to integers which indicate the cluster
memeberships. The ground truth (if known) should be provided here

but this makes little sense to me since providing the ground truth (which is unlikely to be known) somewhat defeats the purpose of using silhouette width to find a best value for k. Could you perhaps clarify what this means? Is it because of TADPole's pruning of distance calculations that the silhouette cannot be calculated? What should I provide here as second set of membership indices? I would be very grateful for your feedback. I have looked into the source code but as an R novice, I fear it's a bit beyond my comprehension.

asardaes commented 6 years ago

You don't need the ground truth for the Silhouette index. If you're getting that error, it's for a different reason. How are you calling cvi? Can you provide some code?

asardaes commented 6 years ago

Your question made me think about a couple of things which you should consider, but my previous comment is still applicable.

TADPole clustering is a very particular algorithm. It actually uses 3 distances: LB_Keogh, DTW and Euclidean. If I remember correctly, all internal CVIs use distance calculations for their computation, and it's not entirely obvious which of the 3 distances one should use. Nevertheless, I realized that right now you probably won't get a "correct" Silhouette index for TADPole with the CRAN version of dtwclust, because the distance function that is returned in the @family slot during TADPole clustering (which is used to calculate CVIs) is actually set to dtw_lb. It's been like that for a long time, and I had not realized that it would affect the calculation of CVIs.

I will change the code so that it uses DTW for the distance calculations in cvi (for TADPole), even though that is not 100% "correct" as I mentioned. In the mean time, you can adjust your script to account for these things, but I think it's easier if you first show me the code you're using to use TADPole.

steipatr commented 6 years ago

Sure! I attached my RStudio script and the associated time series data file. My code for the silhouette is pretty straightforward, pulled from the vignettes/examples.

I also left in the ggplot2 clustered lines plot, the visualization may be informative? Since the time series are model-generated, I do know the ground truth clusters, but obviously I'd like to see whether any clustering methods can find the clusters on their own, or at least get close.

dtwclust_silhouette_steipatr.zip

asardaes commented 6 years ago

Internal CVIs are only implemented for TSClusters objects, which are returned by tsclust. It is true that tsclust calls TADPole, but it also does more things to the result. If you want to use cvi, you will have to go through tsclust:

tadpole_result <- tsclust(df_outcomes, type = "tadpole", k = 3L,
                          control = tadpole_control(dc = 10, window.size = 20L))

To account for what I mentioned earlier, you'll have to do the following before calling cvi:

tadpole_result@family@dist <- dtwclust:::ddist2("dtw_basic", tadpole_result@control)

You could change "dtw_basic" to other distances if you prefer, but each one will give you different Silhouette results. "dtw_basic" will be the default in the next release.

steipatr commented 6 years ago

Ah, that does it of course. I should have checked the returned objects types and noticed that tsclust returns something else than TADPole. Thank you for your help, I am closing this issue.

asardaes commented 6 years ago

Let's leave it open until I make a new release, because the current CRAN version will be using "dtw_lb".