Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

scanbam, with many reginos in which, is very slow #44

Closed gevro closed 2 years ago

gevro commented 2 years ago

Hello, I have a target of 10,000 regions (10,000 different contigs, each with a different range), and scanBam with which is very slow. It basically never finishes in a reasonable time.

Is there inefficiency in scanBam that could be improved? Or is it at the limit of efficiency? Is there another way to speed it up?

Thanks.

mtmorgan commented 2 years ago

This will not be fixed in the near term.

This type of question is better asked on the support site https://support.bioconductor.org/.

This is a known issue, and not easily resolved. This makes 10,000 calls to the underlying C code, each requiring that the index be loaded & parsed and the correct region matched. If you're looking for paired end reads, then each look-up is particularly expensive as paired reads might be outside the contig.

It is better to iterate through the bam file (I would use GenomicAlignments::readGAlignments() to get output that is easier to work with) and filtering each chunk for the regions of interest. For instance, this might be a starting point

library(GenomicAlignments)

fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
## > seqinfo(BamFile(fl))
## Seqinfo object with 2 sequences from an unspecified genome:
##   seqnames seqlengths isCircular genome
##   seq1           1575         NA   <NA>
##   seq2           1584         NA   <NA>

roi <- GRanges(c("seq1:200-299", "seq1:600-699", "seq2:350-449"))

bfl <- BamFile(fl, yieldSize = 100) # use a much larger yieldSize, e.g., 1e6
open(bfl)
found <- NULL
repeat {
    aln <- readGAlignments(bfl)
    if (length(aln) == 0L)
        break
    message(".")
    found <- append(found, subsetByOverlaps(aln, roi))
}
close(bfl)

length(found)
gevro commented 2 years ago

Thanks very much.