asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

Choosing optimum number of clusters with cvi #31

Closed paco-ceam closed 6 years ago

paco-ceam commented 6 years ago

Hi Alexis and thanks for dtwclust package

I'm trying to cluster a set of 115 temperature series with dtwclust but I'm not sure how to choose the optimum number or cluster and clusterting method. I have tried partitional (as seen in an example) with a predefined number of clusters.

  # Anàlisi de clúster
  pc_dtw.max <- tsclust(tmax.estiu2, k = 10:20, preproc = zscore,
                  type="partitional",
                  distance = "dtw_basic", centroid = "dba",
                  trace = TRUE, seed = 100,
                  window.size = 10L,
                  args = tsclust_args(cent = list(trace = TRUE)))# 

As not an expert I have tried changing some parameters from reading dtwclust documentation but could not find big differences in the results. Now I'm trying to run cvi for a sample of different number of clusters to see which nclusters parameter is "better".

Try sapply(pc_dtw.max, cvi, type="internal")

whit this output

                [,1]        [,2]         [,3]        [,4]         [,5]        [,6]
  Sil     0.007919614 0.005149032 -0.005419144 0.008966472 0.0001571701 -0.02020825
  SF      0.000000000 0.000000000  0.000000000 0.000000000 0.0000000000  0.00000000
  CH     10.461114581 9.445447091  8.796729501 8.349410182 7.7393179829  7.13712605
  DB      1.996880488 1.687455700  1.570076739 1.710654571 1.8024923232  1.98925128
  DBstar  2.284530403 1.867781923  1.713612863 1.857947612 2.0986865794  2.23919398
  D       0.318891976 0.351217226  0.340582275 0.360524182 0.3591513913  0.38330590
  COP     0.496265373 0.487373747  0.469942617 0.464935361 0.4707556623  0.45977107
               [,7]        [,8]         [,9]       [,10]       [,11]
  Sil    -0.01923315 -0.01936116 -0.009273884 -0.02458494 -0.02990423
  SF      0.00000000  0.00000000  0.000000000  0.00000000  0.00000000
  CH      6.89926316  6.18977503  5.820197842  5.39745880  5.23057420
  DB      1.88083424  1.73262791  1.655580859  1.85372970  1.75415734
  DBstar  2.13514378  1.92340902  1.888174174  2.10086193  2.07282527
  D       0.29840968  0.35715701  0.326376637  0.34778048  0.31852244
  COP     0.46009509  0.45549048  0.451436062  0.44625196  0.44034457

but can't find out how to manage all these indexes. Should I look for the absolute lowest value between all indexes and choose the associated number of clusters? Are "Sil" negative values meaningless? Or should I look for the n clusters with more lower values from all indexes?

Thanks and best regards

asardaes commented 6 years ago

Unfortunately that's in the area of "no best approach exists, no way to tell what's objectively best". You can try to do some "voting" among the indexes, or just choose a subset and base your decision on that, you could use a more interactive approach (see the ssdtwclust app). There are simpler ways too, check this answer.

Even choosing a subset of CVIs to work with is something I can't help you with. I'd have to read the associated paper for each one and then decide which one might work best for my goal. You might have to do just that (the main references are in the documentation of the cvi function).

asardaes commented 6 years ago

What I mean with voting is something like

cvis <- sapply(pc_dtw.max, cvi, type=c("DB", "DBstar"))
apply(cvis, 1L, which.min)
# ^^ now you have two votes, DB and DB*, each suggesting a certain number of clusters

As a side note: I think that the SF index only works well with distances that are normalized (the distance itself, not the time series).

paco-ceam commented 6 years ago

Hi Alexis, I've had also read the Stack Overflow answer. Well, I'll read some of the references and try to decide. Thanks.