Extreme long processing time for runGenePeakcorr function

bellayqian commented 1 year ago

Hi all,

Thanks for the great work!

I was trying to use your getDORCScores function to obtain the DORC score based on the modified output of Seurat LinkPeaks function. I use Seurat Object's ATAC data to generate SummarizedExperiment object for getDORCScores input. Then, I tried to extract peaks, gene column from the output of LinkPeaks function and made my own dorcTab input, but I failed since I don't know what is "rObs".

So, I tried to follow the entire process of FigR DORC calling. Again, I set up the input ATAC.se and RNAmat as

# Construct SummarizedExperiment object for FigR
counts <- multiomics.subset@assays$peak@counts
rowR <- granges(multiomics.subset)
colData <- multiomics.subset@meta.data
ATAC.se <- SummarizedExperiment(assays=list(counts=counts),rowRanges=rowR,colData=colData)
RNAmat <- multiomics.subset@assays$RNA@data

corr <- runGenePeakcorr(ATAC.se = ATAC.se,
                        RNAmat = RNAmat,
                        genome = "mm10")

I met an extremely long processing time for the runGenePeakcorr function. It has been running for more than 24 hours without generating any warning or error message. The system is currently working on: Running pairs: 65001 to 70000 Running in parallel using 4 cores .. Computing observed correlations .. |========================================================================================| 100%, Elapsed 01:03 Finished!

Time Elapsed: 1.0446768005689 mins

Computing background correlations .. |========================================================================================| 100%, Elapsed 01:03 ...... (about 100 iteration of the above message with similar Elapsed time) So it usually take the computer 2.5 hours to run this "Computing background correlations .." for each 5000 Running Pairs. I don't know if my input is wrong or it is normal to run this long for my data. Please let me know if you need any additional information for my data. Thank you so much for helping me with this issue!

Update on Aug 17, 2022: I figured out that option "nCores" is really important to shorten the processing time, so the only problem I have for now is whether I got the ATAC.se and RNAmat correct since they were from processed SeuratObject. Thank you so much for helping!!

vkartha commented 1 year ago

Hi there - thanks for sharing this log. The runtime (while we didn't explicitely benchmark in the paper) is heavily dependent on the number of cells (correlation coefficient calculations), and the total number of peak-gene pairs being evaluated. That is one of the main reasons we recommend running on systems that allow for parallel computation, but despite this, it can still be the slowest step in the FigR pipeline (we typically ran on 4-8 cores, so as not to try anything completely unrealistic both memory / CPU wise).

whether I got the ATAC.se and RNAmat correct since they were from processed SeuratObject. : from the looks of your code, it appears correct as long as your RNA data has been normalized already (we expect it to be normalized, and not raw). The scATAC counts will get normalized internally if the normalize parameter is set to TRUE (default).

We have actively been working on speeding up the peak-gene association testing, and will provide updated code (same function conventions) in a future version of FigR that can support faster runtimes. We appreciate the feedback you provided!

bellayqian commented 1 year ago

Thank you so much for your reply! Looking forward to the future version of FigR!

buenrostrolab / FigR

Extreme long processing time for runGenePeakcorr function #5