Open yollct opened 11 months ago
Update, when I run for only 10 genes, it works. It is only taking longer than I expected. For the full dataset, it was running for four days and stopped by our server due to the time limit.
Hi, Sorry for the delay in response. Unfortunately the estimation of spatial data can be time consuming. Two solutions you can try:
Please let me know if it works for you!
Best, Dongyuan
Hi,
When I try to run the tutorial code with the changes you mentioned I get the following warning?
Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w, : step failure in theta estimation
Is this a problem?
Best, Lucas
@lucas02061 Hi Lucas,
This is an issue with bam
: https://stats.stackexchange.com/questions/401278/step-failure-in-theta-estimation-using-bam-in-mgcv
Based on my experience, the function does not converge, but usually, it will not be too bad. I think you are probably OK with the result.
Best, Dongyuan
Hi @SONGDONGYUAN1994,
Are there any plans of adding a progress bar to scDesign3? The Marginal Fitting step is very slow on spatial transcriptomics data even with the settings you mentioned earlier. It is hard to know if it is unfeasible to do a run on a given dataset or it is almost done running without a progress bar.
Best, Lucas
@lucas02061 Thanks for raising this problem. Are you still using Windows right now? On Linux, you can use pbmcapply
, but on Windows, I am not sure (there should be one); @qw130 Qingyang, do you have any idea?
Unfortunately, I am traveling in the next two weeks but I will begin on improving the speed after that. Could you first subsample ~ 100 genes and tell me the running time? Thanks!
Best, Dongyuan
For only a 100 genes using the following settings it takes me approximatly15 minutes to run. Once I start going higher on genes though it starts taking many hours.
example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400 sigma_formula = "1", family_use = "nb", n_cores = 2, usebam = TRUE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "bpmapply" )
Thanks for your help.
Best, Lucas
Hi @SONGDONGYUAN1994,
I tried running it on our server (which is quite heavy) for 24 hours and it is still stuck on marginal fitting. It still seems incredibly slow with the settings provided in this thread.
I used the following dataset: class: SingleCellExperiment dim: 18078 684 metadata(0): assays(1): counts rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1 rowData names(1): gene_id colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1 colData names(3): X Y cell_type reducedDimNames(0): mainExpName: NULL altExpNames(0): class: SingleCellExperiment dim: 18078 684 metadata(0): assays(1): counts rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1 rowData names(1): gene_id colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1 colData names(3): X Y cell_type reducedDimNames(0): mainExpName: NULL altExpNames(0):
And I used the following settings. example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400 sigma_formula = "1", family_use = "nb", n_cores = 2, usebam = TRUE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply", BPPARAM=NULL )
Am I doing something wrong?
Best regards, Lucas
Hi Lucas, @lucas02061 I feel that it is probably caused by some very sparse genes; models on those genes might be hard to converge. That is not your fault - could you check if the running time is related to this? I can fix it later. @qw130 Hi Qingyang, could you help Lucas on this? Sorry that I am traveling next week.
Dear Dongyuan,
I also have the same issue in fitting spatial data. Can I feed in 10 genes each time, and repeat the fitting multiple times in independent runs to get the result?
Best, Feng
Hi Feng, If you ignore the gene-gene correlation structure (or you believe that all variations should be explained by spatial coordinates), this is totally OK. It would be better that you can first divide genes into gene modules/groups then do this fitting rather than randomly split genes into 10-gene groups.
Thank you. That is a great point. I will use gene module groups instead.
Hi developers, I want to add scDesign3 to a benchmarking paper that my research group is working on, but I also have trouble with running the tool for spatial data on a decently beefy linux server, where the server gets stuck on marginal fitting. It took approximately one hour and 24 minutes to run on the dataset from the tutorial for spatial data, is it expected for the tool to take a long time running on the this dataset?
I make use of the following parameters with multiple cores to speed it up.
example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 400)", sigma_formula = "1", family_use = "nb", n_cores = 16, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply" ))
Best Regards Nikolaj.
@TotallyRealHolm Dear Nikolaj, Thank you for your interest in our work. Unfortunately, the running time for spatial data is much longer since the spatial pattern is like a 2-D surface, which is much more complicated. We do have some ideas for boosting the simulation.
k
. Now k = 400
, which is large. The fitting time is almost quadratic to k
. You can try a smaller k, which will give you simpler spatial patterns - but the pattern is still there. Based on this idea, we also add a new parameter in function fit_marginal(edf_flexible = TRUE)
, which will automatically select the k.Please let me know if it works for you. Thanks!
Best, Dongyuan
Hi scDesign3 developers,
Thank you for maintaining the amazing tool! When I was simulating the data from https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0, scDesign3 seems to stuck in 'Marginal Fitting' for a long time. Doesn't matter if I just simulate for 10 genes, it stucks at Marginal Fitting. I will very much appreciate your help!! Thanks :)
Best, Chit Tong
Here is the code I used: