MergePeakCoordinates takes long time！

cghmyway commented 1 year ago

Hi， thank you for this package. When I used this package, I found that it took too long in the Peak merge step, and there was still no output result. Here is the history code I ran：

`> setwd("F:/06.scRNA_PM_GM_DM")

library(Sierra) reference.file <- "F:/chicken_index_file/01.gallus_star_index_ensembl/Gallus_gallus.GRCg6a.104.gtf" junction.PM.file <- "F:/06.scRNA_PM_GM_DM/PM/outs/PM_junction.bed" junction.GM.file <- "F:/06.scRNA_PM_GM_DM/GM/outs/GM_junction.bed" junction.DM.file <- "F:/06.scRNA_PM_GM_DM/DM/outs/DM_junction.bed" junction.file <- list(junction.PM.file=junction.PM.file,junction.GM.file=junction.GM.file,junction.DM.file=junction.DM.file) bamfile <- c("F:/06.scRNA_PM_GM_DM/PM/outs/possorted_genome_bam.bam","F:/06.scRNA_PM_GM_DM/GM/outs/possorted_genome_bam.bam","F:/06.scRNA_PM_GM_DM/DM/outs/possorted_genome_bam.bam") whitelist.bc.file <- c("F:/06.scRNA_PM_GM_DM/PM/outs/barcodes.tsv/PM.barcodes.tsv","F:/06.scRNA_PM_GM_DM/GM/outs/barcodes.tsv/GM.barcodes.tsv","F:/06.scRNA_PM_GM_DM/DM/outs/barcodes.tsv/DM.barcodes.tsv") peak.output.file <- c("PM_peaks.txt","GM_peaks.txt","DM_peaks.txt") FindPeaks(output.file = peak.output.file[1],

gtf.file = reference.file,

bamfile = bamfile[1],

junctions.file = junction.PM.file,

ncores = 10) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK 24356 gene entries to process There are 37188 unfiltered sites and 33823 filtered sites There are 33722 sites following duplicate removal Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. FindPeaks(output.file = peak.output.file[2],

gtf.file = reference.file,

bamfile = bamfile[2],

junctions.file = junction.GM.file,

ncores = 20) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK 24356 gene entries to process There are 41141 unfiltered sites and 38711 filtered sites There are 38618 sites following duplicate removal Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. FindPeaks(output.file = peak.output.file[3],

gtf.file = reference.file,

bamfile = bamfile[3],

junctions.file = junction.DM.file,

ncores = 20) Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK 24356 gene entries to process There are 34357 unfiltered sites and 31251 filtered sites There are 31169 sites following duplicate removal Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. peak.dataset.table <- data.frame(Peak_file = peak.output.file, Identifier=c("PM","GM","DM"),stringsAsFactors = FALSE) View(peak.dataset.table) peak.merge.output.file <- "merge_peaks.txt" MergePeakCoordinates(peak.dataset.table = peak.dataset.table, output.file = peak.merge.output.file,ncores = 10) [1] "Performing internal peak merging for PM" [1] "Performing internal peak merging for GM" [1] "Performing internal peak merging for DM" [1] "Comparing peaks from PM to remaining data-sets" [1] "Comparing peaks from GM to remaining data-sets" [1] "Comparing peaks from DM to remaining data-sets"`

And my Bam files for PM, GM and DM were 36.8, 24.8 and 35.3 Gb,respectively. Is there any data or method that can provide the run time?

rj-patrick commented 1 year ago

Sorry for the slow response. I wouldn't expect three datasets to take a long time to merge. How many peaks are in each of the files?

alanlamsiu commented 1 year ago

Hi @rj-patrick. We are facing a similar issue. With a total of 37 samples, the merge step has been running for three weeks and not finished yet. There are > 2.8 million peaks across all samples. "ncores" is set to 50. Do you have any insights regarding our situation? Thanks.

rj-patrick commented 1 year ago

Hi @alanlamsiu, Thanks for that. Do you have output from the MergePeaks function? It would be helpful to know where it has gotten up to. Three weeks is too long, but this is a much larger dataset than what we've tested it on. There's a couple of potential issues, but output would be the best for diagnosing the issue. Cheers, Ralph

alanlamsiu commented 1 year ago

Thanks @rj-patrick for getting back. I can see the size of the log file keeps increasing, which has now 27,482,128 lines, yet the output specified by "output.file" is not generated. Based on the log file, I guess that "internal peak merging" is done for all 37 samples, while the messages "Comparing peaks from to remaining data-sets" are already printed. For computing resources, the average memory usage is 4.5Gb and peak CPU usage is 39.05.

I recalled that when I ran a test using five samples, with a total of > 490 thousand peaks, it took less than a week to complete.

Please let me know if you need more details.

Thanks.

rj-patrick commented 1 year ago

Thanks, to clarify, the message "Comparing peaks from [DATASET X] to remaining data-sets" is printed for how many of the 37 datasets?

alanlamsiu commented 1 year ago

The message has been printed for all 37 datasets.

rj-patrick commented 1 year ago

Thanks. I think I know where the problem is. There is a final step where peaks are iteratively checked for merging, but with your dataset, my guess is it's getting stuck in a loop. Perhaps the best thing at this point would be to merge whatever is remaining, but for now I've set a limit on the number of iterations to go through. Pull the latest update and see if that fixes the issue.

alanlamsiu commented 1 year ago

Thanks @rj-patrick for the fix. I tried the updated version. The run was finished within a few days. I think I am good to go with it.

cghmyway commented 1 year ago

Thank you for solving the problem!

VCCRI / Sierra

MergePeakCoordinates takes long time！ #61