Long read bam processing

SHuang-Broad commented 2 years ago

Hi,

We're experimenting with VerifyBamID2 on long reads contamination estimations. The test sample is a 30X PacBio CCS BAM aligned to HG38.

We found that the process is taking a bit longer than expected:

it seems to be processing the BAM using a single thread and
it took approximately 3 hours for processing chr1 reads

I know this is an off-label use, so would love any suggestions on how to carry out this experiment (including submitting PRs).

Thank you! Steve

Griffan commented 2 years ago

Hi, @SHuang-Broad There was an old ticket for ONT reads, which should have touched parts of you issue:https://github.com/Griffan/VerifyBamID/issues/32

It's known to be slow(potentially with large memory consumption) using default settings with htslib when reading long read bams. (Due to calibration of qualities) If possible, you can provide a subset of the pileup of the BAM file with "—Pileup" argument as the input(which bypassed BAM input processing step)

Let me know if this workaround resolved your issue.

SHuang-Broad commented 2 years ago

Thanks for pointing me in that direction!

One follow-up question: how do I generate the pileup file, with samtools mpileup? There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?

In the meantime, I also run the following experiment:

subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then
run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.

Thanks! Steve

SHuang-Broad commented 2 years ago

One more question out of curiosity: when you mention the slowness possibly comes from htslib parsing the long reads due to calibration of qualities, I wonder

would updating to the latest version of htslib be helpful (I'm using docker v2.0.1)?
can you point me to where I can read more about this calibration of qualities?

Thank you! Steve

Griffan commented 2 years ago

Thanks for pointing me in that direction!

One follow-up question: how do I generate the pileup file, with samtools mpileup? There's a flag (--OutputPileup) in VerifyBamID; it cannot be running VerifyBamID with this option first, right?

In the meantime, I also run the following experiment:

subset the 30X CCS BAM using the 4 BED files in the resources folder first, respectively, then

run VerifyBamID using these BAMs against their matching BED file, respectively

It took approximately 4 hours using 10k sites (1kg and hdp). The two runs with 100k sites are still running (9 hours now). All are single-threaded.

Thanks! Steve

"—OutputPileup" is used for debugging purposes in VB2 when the input file is a bam file. I don't think there will be a difference when the linked htslib in samtools and VB2 are the same.

I think you procedure is worth exploring, but you need to disable the quality calibration argument as disscussed in the referenced issue ticket #32 or refer to http://www.htslib.org/doc/samtools-mpileup.html BAQ related options.

SHuang-Broad commented 2 years ago

Hi Fan,

Sorry, I forgot to provide an update.

I ran an experiment with different options, and here's what I found

VerifyBAMID on a 50X CCS BAM (GIAB HG002 GRCh38)

setup                                              wall-clock time         Alpha
no BAQ, 10K sites     35 minutes ( 15 minutes prep, 20 minutes VB)    0.00217684
BAQ                  100 minutes ( 80 minutes prep, 20 minutes VB)    0.00155812
100K sites           110 minutes ( 45 minutes prep, 55 minutes VB)    0.00220078
BAQ + 100K           520 minutes (460 minutes prep, 60 minutes VB)    0.00181208

So based on this little experiment,

contamination estimation seems to be affected by BAQ (no material difference though, it appears),
both BAQ option and # of sites affect CPU time (threads availability the same throughout)

The WDL pipeline I used is coded here. Briefly, it does the BAM to mpileup conversion per chromosome in parallel and merges the text files later. VerifyBAMID2 then picks up the merged pileup file and does its work, with 4 threads.

Thanks, Steve

Griffan commented 2 years ago

@SHuang-Broad Thank you very much! This profiling is useful for future reference.

Griffan / VerifyBamID

Long read bam processing #43