deeptools / deepTools

Tools to process and analyze deep sequencing data.
Other
685 stars 213 forks source link

Issues with bamCoverage---very slow with relative big bam files #1046

Open Ying-Zhou-428 opened 3 years ago

Ying-Zhou-428 commented 3 years ago

Hi everyone,

I use bamCoverage (bamCoverage version: 2.5.3, python version: 2.7.5) to convert sorted bam files to bigwig files for visualization in UCSC genome browser. It usually takes less than 1 h to covert a 2 G bam file to bigwig file. I recently used the same command to process a 8.2 G bam file and noticed bamCoverage was terribly slow. It was running for several days and no bigwig files were generated. I noticed bam.bai files were generated with several minutes.

I have changed the binsize to 100 and given the max threads and up to 20 cpu, but nothing really sped up the process.

The command I use bamCoverage -bs 100 -b input.sort.bam -p max -o output.norm.bw --normalizeUsingRPKM

The log normalization: RPKM minFragmentLength: 0 verbose: Falses out_file_for_raw_data: None numberOfSamples: None bedFile: None bamFilesList: ['/processed_NGS/input.sort.bam'] ignoreDuplicates: False numberOfProcessors: 20 samFlag_exclude: None save_data: False stepSize: 100 smoothLength: None center_read: False defaultFragmentLength: read length chrsToSkip: [] region: None maxPairedFragmentLength: 1000 samFlag_include: None binLength: 100 blackListFileName: None maxFragmentLength: 0 minMappingQuality: None zerosToNans: False

I do not know what the bug is. I wonder if anyone could help? Many thanks in advance.

dpryan79 commented 3 years ago

Perhaps the IO on your system is extremely slow or you have a very large number of small contigs? Those are the most common causes of poor performance.

Ying-Zhou-428 commented 3 years ago

Thank you very much for your reply. The bamCoverage was run on a high performance computing platform. So can I assume the IO system is OK for this amount of calculation? This might sound stupid, but I do not know how to check or arrange contigs. Could you please specify it? Thanks a lot.

dpryan79 commented 3 years ago

Run samtools view -H on the BAM file and count the number of lines starting with @SQ. That's the number of contigs in your genome. If it's a high number (e.g., 7000, or 50000) then that's the reason for the poor performance.

Ying-Zhou-428 commented 3 years ago

Thank you very much. The contig number of the BAM file is around 500, so I think it is not a very large number. What do you think if I random sampling 30% of the total sequencing reads and do the bamCoverge?

Ying-Zhou-428 commented 3 years ago

And I found bamCoverage can handle files of several dozens of Gs. So my files are not too large to be processed. The temporary bam.bai files were produced within several minutes. It seemed like the process was stuck at the bam.bai stage.

dpryan79 commented 3 years ago

I haven't a clue why things are so slow on your system.

Ying-Zhou-428 commented 3 years ago

Yeah, I understand. Thank you very much for your help.