Closed JenspederM closed 5 years ago
Hi, without specific data it's hard to tell what could be happening, although I suppose it might be difficult to create a minimal example. The pick
function returned by cvi_evaluators
does this in the last step:
list(
object = objs[[best_overall]][[best_by_type[best_overall]]],
config = results[[best_overall]][best_by_type[best_overall], , drop = FALSE]
)
I.e., both elements use the exact same indices, so if there's a mismatch, it means the order was altered before the function was called. You could try executing debugonce(pick_function)
and then running compare_clusterings
, you might be able to see if there's a mismatch there at the beginning (after entering the call to the pick function).
Could you at least share the configuration you are using for compare_clusterings
?
I use the following configuration in compare_clusterings
, with only minor modifications compared to your examples.
# Define overall configuration
cfgs <- compare_clusterings_configs(
types = c("p", "h", "f", "t"),
k = 3L:10L,
controls = list(
partitional = partitional_control(
iter.max = 50L,
nrep = 1L
),
hierarchical = hierarchical_control(
method = "all"
),
fuzzy = fuzzy_control(
# notice the vector
fuzziness = c(2, 2.5)
),
tadpole = tadpole_control(
# notice the vectors
dc = seq(5, 10, 0.5),
window.size = 1L:3L
)
),
preprocs = pdc_configs(
type = "preproc",
# shared
none = list(),
zscore = list(center = c(TRUE, FALSE)),
# only for fuzzy
fuzzy = list(
acf_fun = list()
),
tadpole = list(
zscore = list(center = c(T, F))
),
# specify which should consider the shared ones
share.config = c("p", "h")
),
distances = pdc_configs(
type = "distance",
sbd = list(),
dtw_basic = list(
window.size = 1L:3L,
norm = c("L1", "L2")
),
dtw_lb = list(
window.size = 1L:3L,
norm = c("L1", "L2")
),
fuzzy = list(
L2 = list()
),
share.config = c("p", "h")
),
centroids = pdc_configs(
type = "centroid",
partitional = list(
pam = list(),
shape = list()
),
# special name 'default'
hierarchical = list(
default = list()
),
fuzzy = list(
fcmdd = list()
),
tadpole = list(
default = list(),
shape_extraction = list(znorm = TRUE)
)
)
)
# Remove redundant (shape centroid always uses zscore preprocessing)
id_redundant <- cfgs$partitional$preproc == "none" &
cfgs$partitional$centroid == "shape"
cfgs$partitional <- cfgs$partitional[!id_redundant, ]
# Initiate Scoring & Picking Function
internal_evaluators <- cvi_evaluators(type = "internal")
score_fun <- internal_evaluators$score
pick_fun <- internal_evaluators$pick
# Number of configurations is returned as attribute
num_configs <- sapply(cfgs, attr, which = "num.configs")
cat("\nTotal number of configurations without considering optimizations:",
sum(num_configs),
"\n\n")
Due to the high number of configurations, I then run compare_clusterings
in parallel as described in your example:
require(doParallel)
registerDoParallel(cl <- makeCluster(detectCores()))
comparison_long <- compare_clusterings(data, types = c("p", "h"),
configs = cfgs,
seed = 293L, trace = F,
score.clus = score_fun,
pick.clus = pick_fun,
return.objects = T)
stopCluster(cl)
registerDoSEQ()
Note that I in this specific case only make use of the configurations for hierarchical and partitional clusterings.
Hope that this will help. Please let me know, if there is anything else that I can do.
I wanted to know the configuration because maybe there was a problem during flattening, but I can't find a specific problem. I did this to try and debug (with your configuration):
series <- reinterpolate(CharTraj[1L:20L], 150L)
comparison_long <- compare_clusterings(series, types = c("p", "h"),
configs = cfgs,
seed = 293L, trace = F,
score.clus = score_fun,
pick.clus = pick_fun,
return.objects = T)
pick <- pick_fun(comparison_long$results,
list(comparison_long$objects.partitional,
comparison_long$objects.hierarchical))
obj <- repeat_clustering(series, comparison_long, "config29_1")
cvis <- cvi(obj, type = "internal")
all.equal(comparison_long$pick$object, obj)
And the results in cvis
seem to match those in pick$config
. Inside repeat_clustering
, the parameters for tsclust
are extracted from comparison_long$results$partitional
, so if the object can be correctly reproduced, I'd assume that means the configuration matches.
The call to all.equal
shows that they are not identical, but the differences are just execution times and some attributes (names or whatnot), so it doesn't seem to hint at an issue.
Side note: I strongly recommend you set return.objects
to FALSE
and then use repeat_clustering
, otherwise there will be a lot of deep copies of your input data.
I have now went through the algorithms in the scoring and picking functions to try and figure out what triggers the mistake. However, I have not been able to find any anomalies.
I have followed your advise, and now I use repeat_clustering
to generate the output of my comparison, but I still receive wildly different results when calling cvi
on the output than those reported in comparison_long$pick
.
I'm currently wondering, if it could be a mistake that occur during the parallel computation, i.e. that somehow the scorings are attached incorrectly? I see in compare_clusterings
that the results are computed with foreach(.combine = "rbind")
, and thus it should not be a problem - but it is otherwise unexplainable to me.
I wonder if foreach
's .maxcombine
could be a problem. What OS are you using?
And what version of dtwclust
are you using? Inside compare_clusterings
I don't see any call to foreach
that uses .combine = rbind
.
I have tried to run the configurations on both Mac OS High Sierra and Ubuntu 18.10. In both cases, the returned object did not correspond to the configuration, and further, the running cvi
on the object from repeat_clusterings
did not return CVI's comparable to those returned in comparison_long$results
.
I apologize, I read it wrong, meant to say .combine = list
.
The few tests I did for .maxcombine
didn't show anything weird.
Are you allowed to share the data? And ideally the script you have so far?
@JenspederM after looking more closely at the code, I think I found a possible problem in the way pre-processed series were handled. Since your configurations (and most of the ones I use in my tests) only use zscore
some times, I'm assuming z-normalization didn't have any effect on the results with the data I have, and therefore I never noticed. If you can install the latest version from github, please let me know if it fixes the problem for you.
Hi,
I have been trying to use dtwclust as a basis to test different clustering algorithms against each other. For this, I have relied on the compare_clusterings() function, but when used I see that the object chosen by the picking function (a product of cvi_evaluators(type = "internal")) does not correspond to the model configuration described.
I have found that the CVI displayed corresponds to that of the attached object. However, the configuration registered is completely off.