dm13450 / dirichletprocess

Build dirichletprocess objects for data analysis
https://dm13450.github.io/dirichletprocess/
58 stars 14 forks source link

Different number of clusters of two converged dp model objects trained on same dataset? #25

Open wseis opened 1 year ago

wseis commented 1 year ago

Hi @dm13450, I was using the dirichletprocess package twice on the same data set, I called the two objects dp, and dp2 (see below). I checked convergence following your approach on your blog. All parameters seemed to converge with Gelman Rubin Diagnostics well below 1.1. The summary of the two objects shows a mean number of clusters of 5.26 and 5.22, respectively. However, when I print out the number of clusters per object, it shows that dp has 5 clusters (which I would expect from the average of 5.26), but it also shows that dp2 has only 2 clusters. Can you explain why that is?

print(dp) Dirichlet process object run for 5000 iterations.

Mixing distribution mvnormal Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2 Alpha Prior parameters 2, 4 Conjugacy conjugate Sample size 28

Mean number of clusters 5.26 Median alpha 0.84

dp2 Dirichlet process object run for 5000 iterations.

Mixing distribution mvnormal Base measure parameters c(0, 0), c(1, 0, 0, 1), 2, 2 Alpha Prior parameters 2, 4 Conjugacy conjugate Sample size 28

Mean number of clusters 5.22 Median alpha 0.83 dp$numberClusters [1] 5 dp2$numberClusters [1] 2`

dm13450 commented 1 year ago

Hey, thanks for using the package.

dp$numberClusters is the number of clusters for the last iteration of the fitting process, so will vary, it's just one sample of the DP process. As its a Bayesian model, each data point will have a distribution of likely cluster assignment, rather than just one label.

Does that help?

wseis commented 1 year ago

Hi, thanks for your answer. So when I understood correctly, `dp$numberClusters refers only to the last iteration of the MCMC fit. Given that, how can I determine the most likely cluster label for a given data point? In your blog about "Dirichlet Process Cluster Probabilities," you use

numClusters <- dp$numberClusters clusterLabelProbs <- matrix(nrow=nrow(faithfulTrans), ncol=numClusters + 1)

to determine the required number of columns. As you see in my case, this won't work since the number of clusters strongly varies between iterations.

Does it make sense that when I have n-iterations for my cluster labels contained in the dp$labelsChain, to extract always i-th element, for the i-th data point and use the most frequently occurring cluster assignment?

Something like:

library(purrr) # includes the map function to extract ith element from a list of lists labels1 <- list()

for(i in 1:length(dp$clusterLabels)){ labels1[[as.character(i)]] <- as.numeric(rownames(as.matrix(sort(table(unlist(map(dp$labelsChain, i))), decreasing = T)))[1])

unlist(labels1)

Moreover, I have a similar problem with the ClusterLabelPredict function.

When I apply it twice to the same newData and the same fitted dp-object I get varying results. I assume this is also because it is a random draw from the posterior. However, when I repeat the prediction several times and take the most frequent clusterLabel prediction is stabilized, something like:

## Generating prediction for first two rows of the training set prediction1 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,]))) prediction2 <- replicate(1000, list(ClusterLabelPredict(dp, newData = dp$data[1:2,])))

´## Extracting Frequencies for newData item 2`

t1 <- sort(table(unlist(map(map(prediction1, 1),2))), decreasing = T)[1] t2 <- sort(table(unlist(map(map(prediction2, 1),2))), decreasing = T)[1]

Or is there any more convenient way to extract the most likely cluster label?

dm13450 commented 1 year ago

Yeah everything above is correct, you ClusterLabelPredict is just one sample from the posterior and therefore need to do multiple draws, like you have.

Unfortunately, nothing more convenient, but your code looks fine.