broadinstitute / warp

WDL Analysis Research Pipelines
https://broadinstitute.github.io/warp
BSD 3-Clause "New" or "Revised" License
202 stars 97 forks source link

Necessity to Sort Bam #1415

Open eprdz opened 1 week ago

eprdz commented 1 week ago

Hello, I am using the Whole Genome Germline Single Sample workflow for big WGS experiments. I notices that the task of sorting bam after MarkDuplicates is consuming from 60% to 80% of the execution time of the workflow. I was wondering if this step is 100% necessary and if it is some possibility to speed this process for example using samtools sort instead of Picard SortSam.

Thank you in advance.

jessicaway commented 3 days ago

Hi @eprdz,

I believe the sorting is needed for downstream steps, but @kachulis may be able to comment.

For sorting tools, yes, samtools sort is in fact faster than SortSam (especially running in parallel). We hope to get to optimization of our WGS pipeline soon, however that work is likely on the order of months rather than weeks for our team. Feel free to fork the repo and make the changes you need in the meantime