Open prashanthi-ravichandran opened 1 month ago
For question 1, the change in cluster assignments with a change in the random seed is due to the initialization of model parameters and cluster assignments. In the PRECAST model, the log-likelihood is non-concave with respect to model parameters, which may exist several local maximizers. If the initial value is close to one of these local maximizers, the algorithm may not converge to the global maximizer. This non-concavity is a result of the complexity of the models. A commonly used method to avoid converging to a local maximizer is to try multiple random seeds and select the one that maximizes the log-likelihood.
For question 2, obtaining a true estimate of the log-likelihood is challenging because it requires providing a perfect initial value close to the global maximizer, which is often unattainable. In our algorithm, we first perform PCA on the combined batch-uncorrected expression matrix to obtain a PC score matrix, which is then input into the Gaussian Mixture Model implemented in the Mclust function in the R package mclust. This approach provides a relatively good initial value. However, the Mclust function is also dependent on the random seed due to the non-concavity of the objective function. When the clustering resolution is not easily determined by the BIC criterion, we can manually set the clustering resolution to 12. As demonstrated in the figure, when K exceeds 12, the algorithm fails to converge to the global maximizer, suggesting that 12 is a more reliable choice.
For question 3, the maximum iteration may vary for different datasets. The stopping rule is determined by two factors: (1) maxIter, and (2) epsLogLik, which is the tolerance for the relative difference in log-likelihood. You can set epsLogLik to a fixed value, such as 1e-6, and maxIter to 500. If the relative difference in log-likelihood falls below epsLogLik, the algorithm will stop.
The warning is triggered by the inv_sympd()
function in the Armadillo library. Although theoretically, the covariance matrix should be symmetric, numerical computations can sometimes result in a non-symmetric matrix. However, this is not a significant issue.
Hello,
I've been using PRECAST to infer spatial domains in my 10x Visium dataset, and found some confusing results that I'd like to get clarification on. I ran PRECAST with the following model settings,
Here are my results and questions:
Could you please explain how the seed parameter leads to these differences in the final log-likelihood and why is it that even though the initial log-likelihood does not vary hugely between the different random seeds, the final log-likelihood does? How should users interpret their clustering results, i.e. which random start to pick?
The differences in the log-likelihood do not always indicate how different the final cluster assignments are, and again, with a change in this parameter, we obtain radically different cluster assignments. Here's a plot comparing the rand index for K = 10 across the 5 random seeds.
The final log-likelihood when increasing the clustering resolution should technically increase, since we now have a model with more variance and less bias, and can better fit the data, however with the exception of one of the random seed instances, in general we don't see a higher log-likelihood at higher clustering resolutions. This in turn affects our BIC calculations and picking an appropriate clustering resolution.
In summary, could you please address,
Thanks! Prashanthi