SONGDONGYUAN1994 / scDesign3

scDesign3 generates realistic in silico data for multimodal single-cell and spatial omics
https://songdongyuan1994.github.io/scDesign3/docs/index.html
MIT License
86 stars 24 forks source link

Long runtime for simulating spatial data #22

Open yollct opened 11 months ago

yollct commented 11 months ago

Hi scDesign3 developers,

Thank you for maintaining the amazing tool! When I was simulating the data from https://www.10xgenomics.com/resources/datasets/mouse-brain-serial-section-2-sagittal-anterior-1-standard-1-1-0, scDesign3 seems to stuck in 'Marginal Fitting' for a long time. Doesn't matter if I just simulate for 10 genes, it stucks at Marginal Fitting. I will very much appreciate your help!! Thanks :)

Best, Chit Tong

Here is the code I used:

print("read reference data")
brain <- Load10X_Spatial(data.dir="/anterior_data/",
                        filename="filtered_feature_bc_matrix.h5",
                        assay="Spatial",
                        slice="slice1"
)
brain <- SCTransform(brain, assay = "Spatial", verbose = FALSE)
brain <- RunPCA(brain, assay = "SCT", verbose = FALSE)
brain <- FindNeighbors(brain, reduction = "pca", dims = 1:30)
brain <- FindClusters(brain, verbose = FALSE)
brain <- RunUMAP(brain, reduction = "pca", dims = 1:30)
SpatialFeaturePlot(brain, features = "nCount_Spatial") + theme(legend.position = "right")

brain[['spatial1']] <- GetTissueCoordinates(brain)$imagerow 
brain[['spatial2']] <- GetTissueCoordinates(brain)$imagecol
brain.sce <- as.SingleCellExperiment(brain)

brain.sce <- brain.sce[1:10]

print("simulation")
set.seed(123)

example_simu <- scdesign3(
    sce = brain.sce,
    assay_use = "counts",
    celltype = "ident", ##ident or cell_type
    pseudotime = NULL,
    spatial = c("spatial1", "spatial2"),
    other_covariates = NULL,
    mu_formula = "s(spatial1, spatial2, bs = 'gp', k= 400)",
    sigma_formula = "1",
    family_use = "nb",
    n_cores = 30,
    usebam = FALSE,
    corr_formula = "1",
    copula = "gaussian",
    DT = TRUE,
    pseudo_obs = FALSE,
    return_model = FALSE,
    nonzerovar = FALSE,
    parallelization = "pbmcapply"
  )
yollct commented 11 months ago

Update, when I run for only 10 genes, it works. It is only taking longer than I expected. For the full dataset, it was running for four days and stopped by our server due to the time limit.

SONGDONGYUAN1994 commented 11 months ago

Hi, Sorry for the delay in response. Unfortunately the estimation of spatial data can be time consuming. Two solutions you can try:

  1. Set k = 100 instead of 400. The time increases almost quadratically with k. Smaller k may result in slight under-fit but usually 100 is large enough.
  2. Set usebam = TRUE. It will use a faster estimation.

Please let me know if it works for you!

Best, Dongyuan

lucas02061 commented 9 months ago

Hi,

When I try to run the tutorial code with the changes you mentioned I get the following warning?

Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w, : step failure in theta estimation

Is this a problem?

Best, Lucas

SONGDONGYUAN1994 commented 9 months ago

@lucas02061 Hi Lucas, This is an issue with bam: https://stats.stackexchange.com/questions/401278/step-failure-in-theta-estimation-using-bam-in-mgcv Based on my experience, the function does not converge, but usually, it will not be too bad. I think you are probably OK with the result.

Best, Dongyuan

lucas02061 commented 9 months ago

Hi @SONGDONGYUAN1994,

Are there any plans of adding a progress bar to scDesign3? The Marginal Fitting step is very slow on spatial transcriptomics data even with the settings you mentioned earlier. It is hard to know if it is unfeasible to do a run on a given dataset or it is almost done running without a progress bar.

Best, Lucas

SONGDONGYUAN1994 commented 9 months ago

@lucas02061 Thanks for raising this problem. Are you still using Windows right now? On Linux, you can use pbmcapply, but on Windows, I am not sure (there should be one); @qw130 Qingyang, do you have any idea?

Unfortunately, I am traveling in the next two weeks but I will begin on improving the speed after that. Could you first subsample ~ 100 genes and tell me the running time? Thanks!

Best, Dongyuan

lucas02061 commented 9 months ago

For only a 100 genes using the following settings it takes me approximatly15 minutes to run. Once I start going higher on genes though it starts taking many hours.

example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400 sigma_formula = "1", family_use = "nb", n_cores = 2, usebam = TRUE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "bpmapply" )

Thanks for your help.

Best, Lucas

lucas02061 commented 8 months ago

Hi @SONGDONGYUAN1994,

I tried running it on our server (which is quite heavy) for 24 hours and it is still stuck on marginal fitting. It still seems incredibly slow with the settings provided in this thread.

I used the following dataset: class: SingleCellExperiment dim: 18078 684 metadata(0): assays(1): counts rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1 rowData names(1): gene_id colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1 colData names(3): X Y cell_type reducedDimNames(0): mainExpName: NULL altExpNames(0): class: SingleCellExperiment dim: 18078 684 metadata(0): assays(1): counts rownames(18078): Xkr4 Sox17 ... CAAA01118383.1 CAAA01147332.1 rowData names(1): gene_id colnames(684): AAAGACCCAAGTCGCG-1 AAAGGGATGTAGCAAG-1 ... TTGTGGCCCTGACAGT-1 TTGTTCAGTGTGCTAC-1 colData names(3): X Y cell_type reducedDimNames(0): mainExpName: NULL altExpNames(0):

And I used the following settings. example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 100)", # should be 100-400 sigma_formula = "1", family_use = "nb", n_cores = 2, usebam = TRUE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply", BPPARAM=NULL )

Am I doing something wrong?

Best regards, Lucas

SONGDONGYUAN1994 commented 8 months ago

Hi Lucas, @lucas02061 I feel that it is probably caused by some very sparse genes; models on those genes might be hard to converge. That is not your fault - could you check if the running time is related to this? I can fix it later. @qw130 Hi Qingyang, could you help Lucas on this? Sorry that I am traveling next week.

feng-bao-ucsf commented 4 months ago

Dear Dongyuan,

I also have the same issue in fitting spatial data. Can I feed in 10 genes each time, and repeat the fitting multiple times in independent runs to get the result?

Best, Feng

SONGDONGYUAN1994 commented 4 months ago

Hi Feng, If you ignore the gene-gene correlation structure (or you believe that all variations should be explained by spatial coordinates), this is totally OK. It would be better that you can first divide genes into gene modules/groups then do this fitting rather than randomly split genes into 10-gene groups.

feng-bao-ucsf commented 4 months ago

Thank you. That is a great point. I will use gene module groups instead.

TotallyRealHolm commented 2 months ago

Hi developers, I want to add scDesign3 to a benchmarking paper that my research group is working on, but I also have trouble with running the tool for spatial data on a decently beefy linux server, where the server gets stuck on marginal fitting. It took approximately one hour and 24 minutes to run on the dataset from the tutorial for spatial data, is it expected for the tool to take a long time running on the this dataset?

I make use of the following parameters with multiple cores to speed it up. example_simu <- scdesign3( sce = SCD_Input, assay_use = "counts", celltype = "cell_type", pseudotime = NULL, spatial = c("X", "Y"), other_covariates = NULL, mu_formula = "s(X, Y, bs = 'gp', k= 400)", sigma_formula = "1", family_use = "nb", n_cores = 16, usebam = FALSE, corr_formula = "1", copula = "gaussian", DT = TRUE, pseudo_obs = FALSE, return_model = FALSE, nonzerovar = FALSE, parallelization = "pbmcapply" ))

Best Regards Nikolaj.

SONGDONGYUAN1994 commented 2 months ago

@TotallyRealHolm Dear Nikolaj, Thank you for your interest in our work. Unfortunately, the running time for spatial data is much longer since the spatial pattern is like a 2-D surface, which is much more complicated. We do have some ideas for boosting the simulation.

  1. Reduce the gene number. First, very sparse genes are very difficult to fit and you may set a threshold for gene expression. Second, genes with clear patterns are easier to converge. Therefore, if you would like to have DE/not DE setting, you probably only need to fit genes with high spatial variability as the true DE genes.
  2. Reduce the parameter k. Now k = 400, which is large. The fitting time is almost quadratic to k. You can try a smaller k, which will give you simpler spatial patterns - but the pattern is still there. Based on this idea, we also add a new parameter in function fit_marginal(edf_flexible = TRUE), which will automatically select the k.

Please let me know if it works for you. Thanks!

Best, Dongyuan