WWXkenmo / ENIGMA

A fast and accurate deconvolution algorithm based on regularized matrix completion algorithm (ENIGMA)
MIT License
27 stars 6 forks source link

Any way to speed up the calculation? #8

Open Yun-Ching-Chen opened 1 year ago

Yun-Ching-Chen commented 1 year ago

Hi,

I've tried to run Enigma trace norm for ~500 TCGA samples using my own scRNA data (15 cell types with ~ 10000 genes) as the reference (in the aggregated 10000x15 matrix but not the Seurat object). It has been running over 24 hrs. I wonder if there is any tip to speed up the calculation or if it is possible to make a multi-core version?

Thanks, YC

WWXkenmo commented 1 year ago

Thanks for using our tool!

The reason why ENIGNA trace norm need to perform SVD for each CSE in each round of gradient calculation. Therefore it takes time to optimise when apply ENIGMA trace norm on large datasets (>1000 samples or >5000 spots). Even though current version need to cost a long time on large dataset, it should not cost such long time (over 24 hrs) on ~500 samples. Here is my current thoughts to fix the issue:

  1. please use verbose = TRUE to check if the Kappa Score is keeping decreasing. If not, and Kappa score is increasing. The algorithm is not converge. And I suggest to set a smaller gradient step tao_k
  2. if the Kappa Score is decreasing, then I suggest to set a relative bigger gradient step tao_k, but keep in-mind, too big step size would lead algorithm not converge. Or, you could set a bigger max_ks (e.g. 2-5), to relax the end condition.
  3. Another suggestion I want to give its that please do not use too many reference cell types. Which would lead to the CSE estimation worser, and it's also the same for other CSE estimation or cell type deconvolution tools. Because too many cell types may includes some cell types have very similar gene expression patter (high correlation). I suggest you need to inspect the datasets, try to make sure each cluster has distinct gene expression profile, could be checked through calculating correlation among pairwise cell types. Meanwhile, a lower number of cell types would also help to speed up calculation.

The parallelized ENIGMA is under development and I would upload soon. Hope above information is helpful, and please let me know if you suppose have any new questions.

Best, Ken

WWXkenmo commented 1 year ago

Hey

Have you fixed your question, if you still have problem, could share the data with me (omit some important information) and I could help you to address

Best, Ken