asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

compare_clusterings pick - config and object differ #41

Closed JenspederM closed 5 years ago

JenspederM commented 5 years ago

Hi,

I have been trying to use dtwclust as a basis to test different clustering algorithms against each other. For this, I have relied on the compare_clusterings() function, but when used I see that the object chosen by the picking function (a product of cvi_evaluators(type = "internal")) does not correspond to the model configuration described.

I have found that the CVI displayed corresponds to that of the attached object. However, the configuration registered is completely off.

asardaes commented 5 years ago

Hi, without specific data it's hard to tell what could be happening, although I suppose it might be difficult to create a minimal example. The pick function returned by cvi_evaluators does this in the last step:

list(
    object = objs[[best_overall]][[best_by_type[best_overall]]],
    config = results[[best_overall]][best_by_type[best_overall], , drop = FALSE]
)

I.e., both elements use the exact same indices, so if there's a mismatch, it means the order was altered before the function was called. You could try executing debugonce(pick_function) and then running compare_clusterings, you might be able to see if there's a mismatch there at the beginning (after entering the call to the pick function).

asardaes commented 5 years ago

Could you at least share the configuration you are using for compare_clusterings?

JenspederM commented 5 years ago

I use the following configuration in compare_clusterings, with only minor modifications compared to your examples.

# Define overall configuration
cfgs <- compare_clusterings_configs(
  types = c("p", "h", "f", "t"),
  k = 3L:10L,
  controls = list(
    partitional = partitional_control(
      iter.max = 50L,
      nrep = 1L
    ),
    hierarchical = hierarchical_control(
      method = "all"
    ),
    fuzzy = fuzzy_control(
      # notice the vector
      fuzziness = c(2, 2.5)
    ),
    tadpole = tadpole_control(
      # notice the vectors
      dc = seq(5, 10, 0.5),
      window.size = 1L:3L
    )
  ),
  preprocs = pdc_configs(
    type = "preproc",
    # shared
    none = list(),
    zscore = list(center = c(TRUE, FALSE)),
    # only for fuzzy
    fuzzy = list(
      acf_fun = list()
    ),
    tadpole = list(
      zscore = list(center = c(T, F))
    ),
    # specify which should consider the shared ones
    share.config = c("p", "h")
  ),
  distances = pdc_configs(
    type = "distance",
    sbd = list(),
    dtw_basic = list(
      window.size = 1L:3L,
      norm = c("L1", "L2")
    ),
    dtw_lb = list(
      window.size = 1L:3L,
      norm = c("L1", "L2")
    ),
    fuzzy = list(
      L2 = list()
    ),
    share.config = c("p", "h")
  ),
  centroids = pdc_configs(
    type = "centroid",
    partitional = list(
      pam = list(),
      shape = list()
    ),
    # special name 'default'
    hierarchical = list(
      default = list()
    ),
    fuzzy = list(
      fcmdd = list()
    ),
    tadpole = list(
      default = list(),
      shape_extraction = list(znorm = TRUE)
    )
  )
)

# Remove redundant (shape centroid always uses zscore preprocessing)
id_redundant <- cfgs$partitional$preproc == "none" &
    cfgs$partitional$centroid == "shape"
cfgs$partitional <- cfgs$partitional[!id_redundant, ]

# Initiate Scoring & Picking Function
internal_evaluators <- cvi_evaluators(type = "internal")
score_fun <- internal_evaluators$score
pick_fun <- internal_evaluators$pick

# Number of configurations is returned as attribute
num_configs <- sapply(cfgs, attr, which = "num.configs")
cat("\nTotal number of configurations without considering optimizations:",
    sum(num_configs),
    "\n\n")

Due to the high number of configurations, I then run compare_clusterings in parallel as described in your example:


require(doParallel)
registerDoParallel(cl <- makeCluster(detectCores()))

comparison_long <- compare_clusterings(data, types = c("p", "h"),
                                       configs = cfgs,
                                       seed = 293L, trace = F,
                                       score.clus = score_fun,
                                       pick.clus = pick_fun,
                                       return.objects = T)

stopCluster(cl)

registerDoSEQ()

Note that I in this specific case only make use of the configurations for hierarchical and partitional clusterings.

Hope that this will help. Please let me know, if there is anything else that I can do.

asardaes commented 5 years ago

I wanted to know the configuration because maybe there was a problem during flattening, but I can't find a specific problem. I did this to try and debug (with your configuration):

series <- reinterpolate(CharTraj[1L:20L], 150L)
comparison_long <- compare_clusterings(series, types = c("p", "h"),
                                       configs = cfgs,
                                       seed = 293L, trace = F,
                                       score.clus = score_fun,
                                       pick.clus = pick_fun,
                                       return.objects = T)

pick <- pick_fun(comparison_long$results,
                 list(comparison_long$objects.partitional,
                      comparison_long$objects.hierarchical))

obj <- repeat_clustering(series, comparison_long, "config29_1")
cvis <- cvi(obj, type = "internal")

all.equal(comparison_long$pick$object, obj)

And the results in cvis seem to match those in pick$config. Inside repeat_clustering, the parameters for tsclust are extracted from comparison_long$results$partitional, so if the object can be correctly reproduced, I'd assume that means the configuration matches.

The call to all.equal shows that they are not identical, but the differences are just execution times and some attributes (names or whatnot), so it doesn't seem to hint at an issue.

Side note: I strongly recommend you set return.objects to FALSE and then use repeat_clustering, otherwise there will be a lot of deep copies of your input data.

JenspederM commented 5 years ago

I have now went through the algorithms in the scoring and picking functions to try and figure out what triggers the mistake. However, I have not been able to find any anomalies.

I have followed your advise, and now I use repeat_clustering to generate the output of my comparison, but I still receive wildly different results when calling cvi on the output than those reported in comparison_long$pick.

I'm currently wondering, if it could be a mistake that occur during the parallel computation, i.e. that somehow the scorings are attached incorrectly? I see in compare_clusterings that the results are computed with foreach(.combine = "rbind"), and thus it should not be a problem - but it is otherwise unexplainable to me.

asardaes commented 5 years ago

I wonder if foreach's .maxcombine could be a problem. What OS are you using?

asardaes commented 5 years ago

And what version of dtwclust are you using? Inside compare_clusterings I don't see any call to foreach that uses .combine = rbind.

JenspederM commented 5 years ago

I have tried to run the configurations on both Mac OS High Sierra and Ubuntu 18.10. In both cases, the returned object did not correspond to the configuration, and further, the running cvi on the object from repeat_clusterings did not return CVI's comparable to those returned in comparison_long$results.

I apologize, I read it wrong, meant to say .combine = list.

asardaes commented 5 years ago

The few tests I did for .maxcombine didn't show anything weird.

Are you allowed to share the data? And ideally the script you have so far?

asardaes commented 5 years ago

@JenspederM after looking more closely at the code, I think I found a possible problem in the way pre-processed series were handled. Since your configurations (and most of the ones I use in my tests) only use zscore some times, I'm assuming z-normalization didn't have any effect on the results with the data I have, and therefore I never noticed. If you can install the latest version from github, please let me know if it fixes the problem for you.