Closed AteeqMKhaliq closed 2 years ago
Hi @AteeqMKhaliq,
Thanks for letting me know about this error. Is the pdach1.mat
object a sparse matrix? In restrictCorpus()
, the input object is converted into a dense matrix, and therefore I wonder if the ~130,000 cells by however many genes (I suspect whole genome?) may become too large for R (R has a memory limit, with a theoretical limit of a vector of 2147483647 elements). Ultimately, the input corpus into STdeconvolve
will need to be a dense matrix, but restrictCorpus()
removes most of the features and only keeps informative ones, so hopefully this would result in a smaller matrix that won't exceed memory limits. I think moving forward, you could try removing genes that you know will not be informative (like housekeeping genes, or genes with poor capture efficiency, indicated by very few counts across all of the cells). This should help reduce the sparsity and size of the initial matrix. Additionally, genes that are present in every spot may also not be that informative and you could try removing those as well to further reduce the size of the input matrix.
Let me know if doesn't make sense or you have additional questions, Brendan
Thanks a lot Brendan, Sorry for bothering you so frequently. I have tested STdconvolve on a small number of samples(n=16) and it worked perfectly fine, now I have n=45. and STdecon breaks down at restrictCorpus() level. moving forward, please let me know the best way to retain the most informative genes, are there any tools which I can use? Sorry again for bringing up issues so frequently, but I am in love with STdecon and want to use it in our analysis. Thanks again for your help. Best, AMK
Hi @AteeqMKhaliq,
Not a problem and thanks again for your enthusiasm for STdeconvolve
! When you mean 16 or 45 samples, do you mean the total number of datasets? As in, your 45 datasets contain ~130,000 total spots to deconvolve, correct?
For now, my first suggestion would be to run cleanCounts()
on each dataset if you have not already done so. This will remove spots that have few genes and thus were low quality regions. This hopefully will help reduce the number of spots and make sure you are only working with high quality ones. Note that you can adjust parameters in cleanCounts()
to increase the stringency of filtering.
Next, I suggest running restrictCorpus()
on each of the datasets separately. The purpose of restrictCorpus()
is to first remove poorly captured genes (those which only appear in a few spots in a dataset), and genes which appear in all spots and thus are not very informative of cell types. Next, restrictCorpus()
searches for overdispersed genes, or genes predicted to be cell type specific genes and constrains the dataset matrix to these. After, you could merge all of the restricted matrices for each dataset back together. To do this, for each matrix, you could take the list of spots and genes that remain after restrictCorpus()
and filter the rows and columns of the original matrix. I imagine that some overdispersed genes should be shared among all or most of the datasets and this could be your list of informative cell type genes. Alternatively, you could also try taking the union of the overdispersed genes for each dataset, assuming that some datasets may have specific cell types that are not in others and are identified by specific overdispersed genes specific to the given dataset. Hopefully this approach will allow you to get past the restrictCorpus()
step.
Let me know if doesn't make sense or you have additional questions, Brendan
Thanks a lot Brendan, I ran restrictCorpus() and merged the overdispersed genes, took the counts for these genes and now running
pdac.ldas = fitLDA(t(as.matrix(pdac.merge)), Ks = seq(2,15,by=1), plot=T, verbose=TRUE)
I have a concern here, fitLDA is still running (it is now more than 24Hrs) and I am a bit worried here. is there a way where I can speed up the execution time? Thanks again Brandan for your help, it is much appreciated.
Best,
AMK
Hi @AteeqMKhaliq,
Depending on the size of the input dataset, STdeconvolve
could take a while to fit LDA models. What is the size of your input dataset pdac.merge
? (number of spots and number of selected genes?). fitLDA()
is also parallelizable, in that each LDA model with a given K can be run on a different core to help speed things up. How many cores are available on your machine? You can set the parameter ncores
to be equal to this. Also, I suspect that there will probably be more than 2 cell types in your dataset, so you could set the Ks
to be a more realistic number of cell types. For example, if you have 8 available cores on your machine, you could set the range of Ks
to be something like seq(8,15,by=1)
. That way, each model is run on a different core at the same time, and the bottle neck here will likely be waiting for the model that is fitting the K=15
. But at least this way you don't need to wait for each model to be fit sequentially, and I imagine if your input dataset is large, then it will take a while for each one to run. fitLDA()
should also have verbose=TRUE
by default and should indicate which model is being fit. Has it been hanging on the model with the smallest K this entire time (+24 hours)?
Hope this helps, Brendan
Thanks for replying, yes my model is hanging still at 6! it is now more than 31hrs :(. I will try including ncores
, I hope this should get me through this. My input data is 6754 genes * 100690 cells.
Hi @AteeqMKhaliq,
Hopefully fitting all of your models at once will help speed things up. I will say though that your dataset is exceptionally large and I would expect STdeconvolve
to take some time to fit. In the paper, Supplementary Figure S15 can give you and idea of the runtime. In general, time seems to scale linearly with the number of genes or pixels. For example, with K=12, 1000x1000 spots and genes took a little less than 1000 seconds to run whereas 2702 spots and 1000 genes took about 2000 seconds to run. For some perspective, assuming this linear increase and the size of your dataset, I would not be surprised if it took a day or two to complete for a model of K=12 let's say.
Hope this helps, Brendan
Hi Brendan,
Sorry to bother you again,
i am facing a new error now which I am not able to understand, can you please help. This premature termination shows up during fitLDA()
. As per your suggestion, I used ncores
.
my command is
pdac.ldas = fitLDA(t(as.matrix(pdac.mat)), Ks = seq(8, 15,by=1), plot=T,ncores=16,verbose=TRUE)
Error in env[[as.character(i)]] <- value :
wrong args for environment subassignment
Calls: fitLDA ... bploop -> bploop.iterate -> <Anonymous> -> add_inorder
Execution halted
Thank you, AMK
Hi @AteeqMKhaliq,
I don't know immediately what the problem could be. Are you able to provide the full trace back of the error? It would help me understand at what point in fitLDA()
there is an error occurring. Also, does this error occur very quickly after running the command, or does it happen after fitting is complete? Also, I notice you are using the variable pdac.mat
. Does this differ from your previous pdac.merge
in any way?
Thanks, Brendan
Hi, Sorry to bother you, but I am stuck again at one point here, I am currently having a combined object with ~130,000 cells and when I run the restrictCorpus I am getting the following error
and the error is as follows
Please let me know how best I can handle this error.
Thanks again, and have a great day. Regards, AMK