Closed gevro closed 2 years ago
This will not be fixed in the near term.
This type of question is better asked on the support site https://support.bioconductor.org/.
This is a known issue, and not easily resolved. This makes 10,000 calls to the underlying C code, each requiring that the index be loaded & parsed and the correct region matched. If you're looking for paired end reads, then each look-up is particularly expensive as paired reads might be outside the contig.
It is better to iterate through the bam file (I would use GenomicAlignments::readGAlignments()
to get output that is easier to work with) and filtering each chunk for the regions of interest. For instance, this might be a starting point
library(GenomicAlignments)
fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
## > seqinfo(BamFile(fl))
## Seqinfo object with 2 sequences from an unspecified genome:
## seqnames seqlengths isCircular genome
## seq1 1575 NA <NA>
## seq2 1584 NA <NA>
roi <- GRanges(c("seq1:200-299", "seq1:600-699", "seq2:350-449"))
bfl <- BamFile(fl, yieldSize = 100) # use a much larger yieldSize, e.g., 1e6
open(bfl)
found <- NULL
repeat {
aln <- readGAlignments(bfl)
if (length(aln) == 0L)
break
message(".")
found <- append(found, subsetByOverlaps(aln, roi))
}
close(bfl)
length(found)
Thanks very much.
Hello, I have a target of 10,000 regions (10,000 different contigs, each with a different range), and scanBam with which is very slow. It basically never finishes in a reasonable time.
Is there inefficiency in scanBam that could be improved? Or is it at the limit of efficiency? Is there another way to speed it up?
Thanks.