feiyoung / PRECAST

an efficient data integration method for multiple spatial transcriptomics data with non- cluster-relevant effects such as the complex batch effects.
GNU General Public License v3.0
9 stars 3 forks source link

Discrepancy in PRECAST output based on setting different random seeds. #26

Open prashanthi-ravichandran opened 1 month ago

prashanthi-ravichandran commented 1 month ago

Hello,

I've been using PRECAST to infer spatial domains in my 10x Visium dataset, and found some confusing results that I'd like to get clarification on. I ran PRECAST with the following model settings,

Here are my results and questions:

  1. Upon termination, I compared the curves obtained of the loss function at each iteration for the multiple random seeds that I tested, and found that interestingly, while the initial log-likelihood was not very different across the random seed values, the final value of log-likelihood was indeed remarkably different. Screen Shot 2024-10-24 at 3 26 15 PM Screen Shot 2024-10-24 at 3 26 24 PM

Could you please explain how the seed parameter leads to these differences in the final log-likelihood and why is it that even though the initial log-likelihood does not vary hugely between the different random seeds, the final log-likelihood does? How should users interpret their clustering results, i.e. which random start to pick?

  1. The differences in the log-likelihood do not always indicate how different the final cluster assignments are, and again, with a change in this parameter, we obtain radically different cluster assignments. Here's a plot comparing the rand index for K = 10 across the 5 random seeds.

    Screen Shot 2024-10-24 at 4 07 51 PM
  2. The final log-likelihood when increasing the clustering resolution should technically increase, since we now have a model with more variance and less bias, and can better fit the data, however with the exception of one of the random seed instances, in general we don't see a higher log-likelihood at higher clustering resolutions. This in turn affects our BIC calculations and picking an appropriate clustering resolution.

    Screen Shot 2024-10-24 at 4 00 36 PM

In summary, could you please address,

  1. The discrepancies in the cluster assignments with a change in the random seed, and how should we address this uncertainty in our findings?
  2. When picking a clustering resolution, is it expected for the log-likelihood to be lower at higher clustering resolutions, or is it just for particular random seed instances, how do we get a true estimate of the log-likelihood at any clustering resolution, or at minimum generate a representative distribution (varying random seeds?) to pick the appropriate clustering resolution.
  3. The tutorials typically use 30 iterations, and we do observe greater differences in the log-likelihood estimates as we increase the number of iterations - could this mean something about how many iterations we should run the model for, and why.

Thanks! Prashanthi

feiyoung commented 1 month ago

For question 1, the change in cluster assignments with a change in the random seed is due to the initialization of model parameters and cluster assignments. In the PRECAST model, the log-likelihood is non-concave with respect to model parameters, which may exist several local maximizers. If the initial value is close to one of these local maximizers, the algorithm may not converge to the global maximizer. This non-concavity is a result of the complexity of the models. A commonly used method to avoid converging to a local maximizer is to try multiple random seeds and select the one that maximizes the log-likelihood.

For question 2, obtaining a true estimate of the log-likelihood is challenging because it requires providing a perfect initial value close to the global maximizer, which is often unattainable. In our algorithm, we first perform PCA on the combined batch-uncorrected expression matrix to obtain a PC score matrix, which is then input into the Gaussian Mixture Model implemented in the Mclust function in the R package mclust. This approach provides a relatively good initial value. However, the Mclust function is also dependent on the random seed due to the non-concavity of the objective function. When the clustering resolution is not easily determined by the BIC criterion, we can manually set the clustering resolution to 12. As demonstrated in the figure, when K exceeds 12, the algorithm fails to converge to the global maximizer, suggesting that 12 is a more reliable choice.

For question 3, the maximum iteration may vary for different datasets. The stopping rule is determined by two factors: (1) maxIter, and (2) epsLogLik, which is the tolerance for the relative difference in log-likelihood. You can set epsLogLik to a fixed value, such as 1e-6, and maxIter to 500. If the relative difference in log-likelihood falls below epsLogLik, the algorithm will stop.

The warning is triggered by the inv_sympd() function in the Armadillo library. Although theoretically, the covariance matrix should be symmetric, numerical computations can sometimes result in a non-symmetric matrix. However, this is not a significant issue.