Danko-Lab / BayesPrism

A Fully Bayesian Inference of Tumor Microenvironment composition and gene expression
142 stars 45 forks source link

Question and suggestion for running with large bulk sample set #92

Open FGOSeal opened 1 month ago

FGOSeal commented 1 month ago

Dear developers: First of all, thank you for developing this great algorithms, it really helped my research. Recently, I'm trying to run BayesPrism with ScRef 115000cells19500genes on bulk 17500samples39000genes using 16029 protein_coding genes. The program ran on a HPC node with 80 cores and 2T RAM through LSF and was killed at about 2/3 time point (about 16 days) considering the 'Estimated time to complete' (about 25 days) in the 'Run Gibbs sampling' part of stdout. I'm not sure if it is the HPC not stable, or the memory is not sufficient. I tested with different bulk sample sizes and watched the information displayed by top program, I think the memory usage of BayesPrism can be divided into 4 stages. Stage-1: Before 'Run Gibbs sampling', when inputing 17500 bulk samples, the max VIRT is about 70G, max RES is about 60G. Stage-2: 'Run Gibbs sampling', when inputing 40/100/17500 bulk samples, there are up to 80 processes, each VIRT about 14G, RES about 3.5G. Stage-3: maybe the 'Run Gibbs sampling using updated reference', when inputing 40/100 bulk samples, there are up to 80 processes, each VIRT about 14G, RES about 1.5G. Stage-4: after all Stage-3 processes disappered, there is a new process. When inputing 40 bulk samples, it use about RES=48GB=1.2G40. When inputing 100 bulk samples, it use about RES=120GB=1.2G100. Inputing 40/100/200 samples run are all finished successfully. Now I'm inputing all bulk samples again. So my first question is, does the Stage-4 RES really=1.2GN_bulk and the running with 17500 samples will definitely fail when reaching stage-4 ? If so, I will stop my current calculation. I tried calculate the first 100 samples or the first 200 samples in 2 run. The results of the same sample are not exactly the same. Repeating calculation of the first 100 samples gives exactly the same results. So it is the normalization on the input bulk samples makes the differences between 100-sample-run and 200-sample-run. So my second question is, is it possible for me to split my bulk sample set and run in several batches and get the same/similar results as inputing all bulk sample in one run ? Right now all I can do is subsampling. But it will be good if BayesPrism can better support large sample set. Finally, if the Stage-4 RES=1.2GN_bulk is right, maybe you can add a function to estimate the approximate max RAM consumption and warn the user at an early time point.

Best regards Yi-hua Jiang

tinyi commented 1 month ago

Hi Yi-hua,

Sorry for the delayed response.

The memory usage depends on the #of cell types and cell states, #of bulk samples and #of genes. Below are a few suggestions for you to try.

Try removing everything in your work environment except for your input prism object, followed by cleaning up the memory using gc(). This will save a huge memory space as your input scRNA-seq data is also huge.

Try using fewer cores. This may slightly increase the run-time, but will lower the memory usage.

Regarding splitting the bulk sample, I would recommend doing so if there are underlying biological confounders. For example you can split the bulk by experimental conditions / batches / sex, etc. This is perfectly legitimate, and will better fit the statistical assumption of BayesPrism. However, the result from splitting the sample will be different from that obtained when you run all samples together due to the design of the algorithm.

The other memory consuming part is the imputation of the cell type-specific expression tensor Z. We are updating the BayesPrism soon to allow the skipping of this step to better accommodate large datasets.

Best,

Tinyi