How to get the n best configs after `compare_clusterings()`?

veroandreo commented 3 years ago

Hi,

Following the examples in the vignette and manual pages, I'm using compare_clusterings_configs() plus compare_clusterings() to obtain the best cluster configuration for my dataset.

I wonder now if there's a way to get the 10 best configurations, for example, or the best for each distance that I evaluate, i.e., DTW, SBD, etc. How can I do the scoring myself and select the best configs from the huge table at comparison_part$results$partitional ?

# configs
cfg <- compare_clusterings_configs(
  types = "partitional",
  k = 2:5,
  controls = list(partitional = partitional_control(iter.max = 100L, 
                                                    nrep = 5L)),
  preprocs = pdc_configs("preproc",
                         none = list(),
                         zscore = list(center = c(FALSE, TRUE))),
  distances = pdc_configs("distance",
                          partitional = list(
                            dtw_basic = list(
                              window.size = seq(from = 1L, to = 5L, by = 1L),
                              norm = "L2"),
                            dtw_lb = list(
                              window.size = seq(from = 1L, to = 5L, by = 1L),
                              norm = "L2"),
                            sbd = list()
                            )
                          ),
  centroids = pdc_configs("centroid",
                          share.config = c("p"),
                          dba = list(
                            window.size = seq(from = 1L, to = 5L, by = 1L),
                            norm = "L2"),
                          shape = list(znorm = TRUE),
                          pam = list()
                          ),
  no.expand = "window.size"
)

# set score and pick functions
vi_evaluators <- cvi_evaluators("valid")
score_fun <- vi_evaluators$score
pick_fun <- vi_evaluators$pick 

# compare
comparison_part <- 
  compare_clusterings(data,
                      types = "partitional",
                      configs = cfg,
                      seed = 3L,
                      trace = TRUE,
                      score.clus = score_fun,
                      pick.clus = pick_fun,
                      shuffle.configs = TRUE,
                      return.objects = TRUE)

# info of the best rep
comparison_part$pick$config

Thank in advance for any hints!

asardaes commented 3 years ago

The pick function returned by cvi_evaluators does majority voting by default, and I don't know if it's possible to generalize that to more than 1 result. You could try some heuristics perhaps, for example a result I just got with the demo data looks like this:

> apply(comparison_part$results$partitional[17:23], 2, which.max)
   Sil      D    COP     DB DBstar     CH     SF 
    59    373     98     98     91    644    236

So, in my case, COP and DB agreed. I could then figure out how many configurations appear in the top X of both, say with X=20:

intersect(
    sort(comparison_part$results$partitional$COP, decreasing = TRUE, index.return = TRUE)$ix[1:20],
    sort(comparison_part$results$partitional$DB, decreasing = TRUE, index.return = TRUE)$ix[1:20]
)
[1]  98 717 236 720  59  40  39

That's just the first thing that came to mind, I can't really say if it's a particularly good idea :stuck_out_tongue:

veroandreo commented 3 years ago

Thanks for your answer @asardaes !

I thought maybe the score table within the compare_clustering() result could hold the votes, hence it would be easy to pick the 10 most voted configs, for example. But the score table only contains the CVI values.

I'm thinking now I could also generate configs for different distances and then compare the best results from them. In that way I'd have at least the best clustering config per distance and I could do the voting among them or so. I will investigate further :-)

asardaes / dtwclust

How to get the n best configs after `compare_clusterings()`? #49