JEFworks-Lab / STdeconvolve

Reference-free cell-type deconvolution of multi-cellular spatially resolved transcriptomics data
http://jef.works/STdeconvolve/
112 stars 13 forks source link

Error in asMethod(object) : Cholmod error 'problem too large' at file #24

Closed AteeqMKhaliq closed 2 years ago

AteeqMKhaliq commented 2 years ago

Hi, Sorry to bother you, but I am stuck again at one point here, I am currently having a combined object with ~130,000 cells and when I run the restrictCorpus I am getting the following error

## filter for features in less than 100% of pixels but more than 5% of pixels
pdac.mat <- restrictCorpus(pdach1.mat, removeAbove = 1.0, removeBelow = 0.05)

and the error is as follows


Error in asMethod(object) :
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 102
Calls: restrictCorpus ... is.data.frame -> as.matrix -> as.matrix.Matrix -> as -> asMethod
Execution halted

Please let me know how best I can handle this error.

Thanks again, and have a great day. Regards, AMK

bmill3r commented 2 years ago

Hi @AteeqMKhaliq,

Thanks for letting me know about this error. Is the pdach1.mat object a sparse matrix? In restrictCorpus(), the input object is converted into a dense matrix, and therefore I wonder if the ~130,000 cells by however many genes (I suspect whole genome?) may become too large for R (R has a memory limit, with a theoretical limit of a vector of 2147483647 elements). Ultimately, the input corpus into STdeconvolve will need to be a dense matrix, but restrictCorpus() removes most of the features and only keeps informative ones, so hopefully this would result in a smaller matrix that won't exceed memory limits. I think moving forward, you could try removing genes that you know will not be informative (like housekeeping genes, or genes with poor capture efficiency, indicated by very few counts across all of the cells). This should help reduce the sparsity and size of the initial matrix. Additionally, genes that are present in every spot may also not be that informative and you could try removing those as well to further reduce the size of the input matrix.

Let me know if doesn't make sense or you have additional questions, Brendan

AteeqMKhaliq commented 2 years ago

Thanks a lot Brendan, Sorry for bothering you so frequently. I have tested STdconvolve on a small number of samples(n=16) and it worked perfectly fine, now I have n=45. and STdecon breaks down at restrictCorpus() level. moving forward, please let me know the best way to retain the most informative genes, are there any tools which I can use? Sorry again for bringing up issues so frequently, but I am in love with STdecon and want to use it in our analysis. Thanks again for your help. Best, AMK

bmill3r commented 2 years ago

Hi @AteeqMKhaliq,

Not a problem and thanks again for your enthusiasm for STdeconvolve! When you mean 16 or 45 samples, do you mean the total number of datasets? As in, your 45 datasets contain ~130,000 total spots to deconvolve, correct?

For now, my first suggestion would be to run cleanCounts() on each dataset if you have not already done so. This will remove spots that have few genes and thus were low quality regions. This hopefully will help reduce the number of spots and make sure you are only working with high quality ones. Note that you can adjust parameters in cleanCounts() to increase the stringency of filtering.

Next, I suggest running restrictCorpus() on each of the datasets separately. The purpose of restrictCorpus() is to first remove poorly captured genes (those which only appear in a few spots in a dataset), and genes which appear in all spots and thus are not very informative of cell types. Next, restrictCorpus() searches for overdispersed genes, or genes predicted to be cell type specific genes and constrains the dataset matrix to these. After, you could merge all of the restricted matrices for each dataset back together. To do this, for each matrix, you could take the list of spots and genes that remain after restrictCorpus() and filter the rows and columns of the original matrix. I imagine that some overdispersed genes should be shared among all or most of the datasets and this could be your list of informative cell type genes. Alternatively, you could also try taking the union of the overdispersed genes for each dataset, assuming that some datasets may have specific cell types that are not in others and are identified by specific overdispersed genes specific to the given dataset. Hopefully this approach will allow you to get past the restrictCorpus() step.

Let me know if doesn't make sense or you have additional questions, Brendan

AteeqMKhaliq commented 2 years ago

Thanks a lot Brendan, I ran restrictCorpus() and merged the overdispersed genes, took the counts for these genes and now running pdac.ldas = fitLDA(t(as.matrix(pdac.merge)), Ks = seq(2,15,by=1), plot=T, verbose=TRUE) I have a concern here, fitLDA is still running (it is now more than 24Hrs) and I am a bit worried here. is there a way where I can speed up the execution time? Thanks again Brandan for your help, it is much appreciated. Best, AMK

bmill3r commented 2 years ago

Hi @AteeqMKhaliq,

Depending on the size of the input dataset, STdeconvolve could take a while to fit LDA models. What is the size of your input dataset pdac.merge? (number of spots and number of selected genes?). fitLDA() is also parallelizable, in that each LDA model with a given K can be run on a different core to help speed things up. How many cores are available on your machine? You can set the parameter ncores to be equal to this. Also, I suspect that there will probably be more than 2 cell types in your dataset, so you could set the Ks to be a more realistic number of cell types. For example, if you have 8 available cores on your machine, you could set the range of Ks to be something like seq(8,15,by=1). That way, each model is run on a different core at the same time, and the bottle neck here will likely be waiting for the model that is fitting the K=15. But at least this way you don't need to wait for each model to be fit sequentially, and I imagine if your input dataset is large, then it will take a while for each one to run. fitLDA() should also have verbose=TRUE by default and should indicate which model is being fit. Has it been hanging on the model with the smallest K this entire time (+24 hours)?

Hope this helps, Brendan

AteeqMKhaliq commented 2 years ago

Thanks for replying, yes my model is hanging still at 6! it is now more than 31hrs :(. I will try including ncores, I hope this should get me through this. My input data is 6754 genes * 100690 cells.

bmill3r commented 2 years ago

Hi @AteeqMKhaliq,

Hopefully fitting all of your models at once will help speed things up. I will say though that your dataset is exceptionally large and I would expect STdeconvolve to take some time to fit. In the paper, Supplementary Figure S15 can give you and idea of the runtime. In general, time seems to scale linearly with the number of genes or pixels. For example, with K=12, 1000x1000 spots and genes took a little less than 1000 seconds to run whereas 2702 spots and 1000 genes took about 2000 seconds to run. For some perspective, assuming this linear increase and the size of your dataset, I would not be surprised if it took a day or two to complete for a model of K=12 let's say.

Hope this helps, Brendan

AteeqMKhaliq commented 2 years ago

Hi Brendan, Sorry to bother you again, i am facing a new error now which I am not able to understand, can you please help. This premature termination shows up during fitLDA(). As per your suggestion, I used ncores. my command is pdac.ldas = fitLDA(t(as.matrix(pdac.mat)), Ks = seq(8, 15,by=1), plot=T,ncores=16,verbose=TRUE)

Error in env[[as.character(i)]] <- value :
  wrong args for environment subassignment
Calls: fitLDA ... bploop -> bploop.iterate -> <Anonymous> -> add_inorder
Execution halted

Thank you, AMK

bmill3r commented 2 years ago

Hi @AteeqMKhaliq,

I don't know immediately what the problem could be. Are you able to provide the full trace back of the error? It would help me understand at what point in fitLDA() there is an error occurring. Also, does this error occur very quickly after running the command, or does it happen after fitting is complete? Also, I notice you are using the variable pdac.mat. Does this differ from your previous pdac.merge in any way?

Thanks, Brendan