Closed kaichop closed 9 years ago
Shown below is breakdown of time consumption for a whole genome data set (HiSeq-J from AllSeq.com). You are right, picard removal of PCR duplicates is slow. I'll try SAMtools, but previously SAMtools' result is not accepted by GATK.
Alignment stats calculation is also slow, I want to write a multi-threaded version but haven't got around to it.
QC assesment on BAM files: 151.9 min
Remove duplicates: 842.8 min
Filter BAM file: 133.7 min
Index BAM file: 17.8 min
GATK realignment: 2.3 min
Apply realignment: 20.2 min
Index BAM files: 0.5 min
GATK HaplotypeCaller variant calling: 7.8 min
GATK variant filtering: 5.7 min
SAMtools variant calling: 40.1 min
Varscan variant calling: 42.8 min
Merge split VCF: 0.4 min
Rename VCF: 0.0 min
Merge split VCF: 0.3 min
Rename VCF: 0.0 min
Extract variants in custom regions: 0.1 min
Extract variants in custom regions: 0.1 min
Extract variants in custom regions: 0.1 min
Extract consensus calls: 0.4 min
Generate QC stat: 0.0 min
Generate Venn digram: 0.2 min
Generate alignment and coverage stat: 482.1 min
Generate variant stat: 0.1 min
Remove intermediate files: 0.0 min
Remove temporary files: 0.0 min
Generating html report: 0.2 min
It takes 293 min for SAMtools to remove PCR duplicates. And its result is now accepted by GATK. SAMtools 1.1 cannot do rmdup at the moment. It takes 514 min for alignment stats calculation using SAMtools 1.1, apparently the new version did not improve the 'depth' subprogram.
I will replace picard rmdup with SAMtools rmdup.
do some research on how much time each procedure takes on whole-genome data
If picard is too slow, we should just switch to samtools to further improve speed; it seems that picard takes a lot of time...