Open SHuang-Broad opened 2 years ago
Hi, @SHuang-Broad There was an old ticket for ONT reads, which should have touched parts of you issue:https://github.com/Griffan/VerifyBamID/issues/32
It's known to be slow(potentially with large memory consumption) using default settings with htslib when reading long read bams. (Due to calibration of qualities) If possible, you can provide a subset of the pileup of the BAM file with "—Pileup" argument as the input(which bypassed BAM input processing step)
Let me know if this workaround resolved your issue.
Thanks for pointing me in that direction!
One follow-up question: how do I generate the pileup file, with samtools mpileup
?
There's a flag (--OutputPileup
) in VerifyBamID
; it cannot be running VerifyBamID
with this option first, right?
In the meantime, I also run the following experiment:
resources
folder first, respectively, thenVerifyBamID
using these BAMs against their matching BED file, respectivelyIt took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.
Thanks! Steve
One more question out of curiosity: when you mention the slowness possibly comes from htslib parsing the long reads due to calibration of qualities, I wonder
Thank you! Steve
Thanks for pointing me in that direction!
One follow-up question: how do I generate the pileup file, with
samtools mpileup
? There's a flag (--OutputPileup
) inVerifyBamID
; it cannot be runningVerifyBamID
with this option first, right?In the meantime, I also run the following experiment:
- subset the 30X CCS BAM using the 4 BED files in the
resources
folder first, respectively, then- run
VerifyBamID
using these BAMs against their matching BED file, respectivelyIt took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.
Thanks! Steve
"—OutputPileup" is used for debugging purposes in VB2 when the input file is a bam file. I don't think there will be a difference when the linked htslib in samtools and VB2 are the same.
I think you procedure is worth exploring, but you need to disable the quality calibration argument as disscussed in the referenced issue ticket #32 or refer to http://www.htslib.org/doc/samtools-mpileup.html BAQ related options.
Hi Fan,
Sorry, I forgot to provide an update.
I ran an experiment with different options, and here's what I found
VerifyBAMID on a 50X CCS BAM (GIAB HG002 GRCh38)
setup wall-clock time Alpha
no BAQ, 10K sites 35 minutes ( 15 minutes prep, 20 minutes VB) 0.00217684
BAQ 100 minutes ( 80 minutes prep, 20 minutes VB) 0.00155812
100K sites 110 minutes ( 45 minutes prep, 55 minutes VB) 0.00220078
BAQ + 100K 520 minutes (460 minutes prep, 60 minutes VB) 0.00181208
So based on this little experiment,
The WDL pipeline I used is coded here.
Briefly, it does the BAM to mpileup conversion per chromosome in parallel and merges the text files later.
VerifyBAMID2
then picks up the merged pileup file and does its work, with 4 threads.
Thanks, Steve
@SHuang-Broad Thank you very much! This profiling is useful for future reference.
Hi,
We're experimenting with
VerifyBamID2
on long reads contamination estimations. The test sample is a 30X PacBio CCS BAM aligned to HG38.We found that the process is taking a bit longer than expected:
I know this is an off-label use, so would love any suggestions on how to carry out this experiment (including submitting PRs).
Thank you! Steve