Add function parameter for hclust method

JakeNel28 commented 6 years ago

Hi guys,

Came across your package just recently, and really love the idea of it. My analytics team wants to start incorporating it into our workflow. I was hoping for clarification on why results differ so much when using hierarchical clustering from the stats package. I hope I'm not missing something obvious here.

I've simulated a data set and used dicer::consensus_clustering and stats::hclust, with 5 clusters. Both use "euclidean" distances. I use consensus_cluster with only one rep, and use 100% of the items. I expect that the clustering assignments should be very similar

library(tidyverse)
library(diceR)
library(stats)

#simulate data set
means <-  c(3.23, 2.78, 3.85, 3.55, 3.95, 3.21, 3.52, 4.42, 4.15,  3.68)
sds <- c(1.48, 1.4, 1.31, 1.43, 1.44, 1.41, 1.25, 1.04, 1.08, 1.26)
N <- 500
set.seed(100)
input_data <- map2(means, sds, ~ rnorm(n = N, mean = .x, sd = .y)) %>%
  set_names(paste("Var", 1:length(.), sep = "_")) %>%
  as_tibble()

#setup problem
k <- 5

#Create two models, one from diceR::consensus_cluster and the other from stats
dicer_clusters <- consensus_cluster(input_data, nk = k, algorithms = "hc", 
                                    reps = 1, p.item = 1) %>%
  as.data.frame.table() %>%
  select(Freq) %>%
  as_vector() %>% 
  unname()

stats_clusters <- dist(input_data, method = "euclidian") %>% 
  hclust() %>%
  cutree(k)

#Compare
table(dicer_clusters, stats_clusters)
#>               stats_clusters
#> dicer_clusters   1   2   3   4   5
#>              1  98 124 179  37  54
#>              2   0   0   2   0   0
#>              3   2   0   0   0   0
#>              4   0   0   2   0   0
#>              5   0   0   0   0   2

Created on 2018-05-18 by the reprex package (v0.2.0).

As you can see from the table output, most of the clustering assignments that diceR produces are all in the same cluster; the first one, where as stats produces an even spread between 5 clusters. I would expect to be one central mass for every row/column in this xy table. I've had similar experiences trying different algorithms and on more "real" data sets.

Can you help clarify why the results are not comparable?

dchiu911 commented 6 years ago

Hi @JakeNel28 ,

the difference lies in the fact that in diceR:::hc we use method = "average" instead of the default for hclust, which is method = "complete". Currently we do not allow the user to pass different arguments. Please let me know if you would like this feature added.

JakeNel28 commented 6 years ago

I believe it would be a worthy inclusion!

dchiu911 commented 6 years ago

Hi @JakeNel28 there is now agreement between the two methods

library(tidyverse)
library(diceR)
library(stats)

#simulate data set
means <-  c(3.23, 2.78, 3.85, 3.55, 3.95, 3.21, 3.52, 4.42, 4.15,  3.68)
sds <- c(1.48, 1.4, 1.31, 1.43, 1.44, 1.41, 1.25, 1.04, 1.08, 1.26)
N <- 500
set.seed(100)
input_data <- map2(means, sds, ~ rnorm(n = N, mean = .x, sd = .y)) %>%
  set_names(paste("Var", 1:length(.), sep = "_")) %>%
  as_tibble()

#setup problem
k <- 5

#Create two models, one from diceR::consensus_cluster and the other from stats
dicer_clusters <- consensus_cluster(
  data = input_data,
  nk = k,
  algorithms = "hc",
  hc.method = "complete",
  reps = 1,
  p.item = 1
) %>% 
  as.data.frame.table() %>%
  select(Freq) %>%
  as_vector() %>% 
  unname()

stats_clusters <- dist(input_data, method = "euclidian") %>% 
  hclust() %>%
  cutree(k)

# Compare after relabelling based on reference
stats_clusters %>% 
  relabel_class(dicer_clusters) %>% 
  table(stats_clusters = ., dicer_clusters)
#>               dicer_clusters
#> stats_clusters   1   2   3   4   5
#>              1 183   0   0   0   0
#>              2   0  56   0   0   0
#>              3   0   0 124   0   0
#>              4   0   0   0 100   0
#>              5   0   0   0   0  37

Created on 2018-06-11 by the reprex package (v0.2.0).

AlineTalhouk / diceR

Add function parameter for hclust method #130