Closed JakeNel28 closed 6 years ago
Hi @JakeNel28 ,
the difference lies in the fact that in diceR:::hc
we use method = "average"
instead of the default for hclust
, which is method = "complete"
. Currently we do not allow the user to pass different arguments. Please let me know if you would like this feature added.
I believe it would be a worthy inclusion!
Hi @JakeNel28 there is now agreement between the two methods
library(tidyverse)
library(diceR)
library(stats)
#simulate data set
means <- c(3.23, 2.78, 3.85, 3.55, 3.95, 3.21, 3.52, 4.42, 4.15, 3.68)
sds <- c(1.48, 1.4, 1.31, 1.43, 1.44, 1.41, 1.25, 1.04, 1.08, 1.26)
N <- 500
set.seed(100)
input_data <- map2(means, sds, ~ rnorm(n = N, mean = .x, sd = .y)) %>%
set_names(paste("Var", 1:length(.), sep = "_")) %>%
as_tibble()
#setup problem
k <- 5
#Create two models, one from diceR::consensus_cluster and the other from stats
dicer_clusters <- consensus_cluster(
data = input_data,
nk = k,
algorithms = "hc",
hc.method = "complete",
reps = 1,
p.item = 1
) %>%
as.data.frame.table() %>%
select(Freq) %>%
as_vector() %>%
unname()
stats_clusters <- dist(input_data, method = "euclidian") %>%
hclust() %>%
cutree(k)
# Compare after relabelling based on reference
stats_clusters %>%
relabel_class(dicer_clusters) %>%
table(stats_clusters = ., dicer_clusters)
#> dicer_clusters
#> stats_clusters 1 2 3 4 5
#> 1 183 0 0 0 0
#> 2 0 56 0 0 0
#> 3 0 0 124 0 0
#> 4 0 0 0 100 0
#> 5 0 0 0 0 37
Created on 2018-06-11 by the reprex package (v0.2.0).
Hi guys,
Came across your package just recently, and really love the idea of it. My analytics team wants to start incorporating it into our workflow. I was hoping for clarification on why results differ so much when using hierarchical clustering from the stats package. I hope I'm not missing something obvious here.
I've simulated a data set and used
dicer::consensus_clustering
andstats::hclust
, with 5 clusters. Both use "euclidean" distances. I useconsensus_cluster
with only one rep, and use 100% of the items. I expect that the clustering assignments should be very similarCreated on 2018-05-18 by the reprex package (v0.2.0).
As you can see from the table output, most of the clustering assignments that diceR produces are all in the same cluster; the first one, where as stats produces an even spread between 5 clusters. I would expect to be one central mass for every row/column in this xy table. I've had similar experiences trying different algorithms and on more "real" data sets.
Can you help clarify why the results are not comparable?