Griffan / VerifyBamID

VerifyBamID2: A robust tool for DNA contamination estimation from sequence reads using ancestry-agnostic method.
http://griffan.github.io/VerifyBamID/
94 stars 15 forks source link

Long read bam processing #43

Open SHuang-Broad opened 2 years ago

SHuang-Broad commented 2 years ago

Hi,

We're experimenting with VerifyBamID2 on long reads contamination estimations. The test sample is a 30X PacBio CCS BAM aligned to HG38.

We found that the process is taking a bit longer than expected:

I know this is an off-label use, so would love any suggestions on how to carry out this experiment (including submitting PRs).

Thank you! Steve

Griffan commented 2 years ago

Hi, @SHuang-Broad There was an old ticket for ONT reads, which should have touched parts of you issue:https://github.com/Griffan/VerifyBamID/issues/32

It's known to be slow(potentially with large memory consumption) using default settings with htslib when reading long read bams. (Due to calibration of qualities) If possible, you can provide a subset of the pileup of the BAM file with "—Pileup" argument as the input(which bypassed BAM input processing step)

Let me know if this workaround resolved your issue.

SHuang-Broad commented 2 years ago

Thanks for pointing me in that direction!


One follow-up question: how do I generate the pileup file, with samtools mpileup? There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?


In the meantime, I also run the following experiment:

  1. subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then
  2. run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.

Thanks! Steve

SHuang-Broad commented 2 years ago

One more question out of curiosity: when you mention the slowness possibly comes from htslib parsing the long reads due to calibration of qualities, I wonder

Thank you! Steve

Griffan commented 2 years ago

Thanks for pointing me in that direction!

One follow-up question: how do I generate the pileup file, with samtools mpileup? There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?

In the meantime, I also run the following experiment:

  1. subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then
  2. run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.

Thanks! Steve

"—OutputPileup" is used for debugging purposes in VB2 when the input file is a bam file. I don't think there will be a difference when the linked htslib in samtools and VB2 are the same.

I think you procedure is worth exploring, but you need to disable the quality calibration argument as disscussed in the referenced issue ticket #32 or refer to http://www.htslib.org/doc/samtools-mpileup.html BAQ related options.

SHuang-Broad commented 2 years ago

Hi Fan,

Sorry, I forgot to provide an update.

I ran an experiment with different options, and here's what I found

VerifyBAMID on a 50X CCS BAM (GIAB HG002 GRCh38)

setup                                              wall-clock time         Alpha
no BAQ, 10K sites     35 minutes ( 15 minutes prep, 20 minutes VB)    0.00217684
BAQ                  100 minutes ( 80 minutes prep, 20 minutes VB)    0.00155812
100K sites           110 minutes ( 45 minutes prep, 55 minutes VB)    0.00220078
BAQ + 100K           520 minutes (460 minutes prep, 60 minutes VB)    0.00181208

So based on this little experiment,

The WDL pipeline I used is coded here. Briefly, it does the BAM to mpileup conversion per chromosome in parallel and merges the text files later. VerifyBAMID2 then picks up the merged pileup file and does its work, with 4 threads.

Thanks, Steve

Griffan commented 2 years ago

@SHuang-Broad Thank you very much! This profiling is useful for future reference.