Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

Loading bam file in chunks #42

Closed gevro closed 2 years ago

gevro commented 2 years ago

Is there a method for loading a specific reproducible chunk of a BAM/CRAM file? This would be useful for very large BAM/CRAM files to avoid loading it all into memory, and in separate processes to load specific chunks, from 1 ... X, as the user defines.

mtmorgan commented 2 years ago

This type of question is better asked on the support site https://support.bioconductor.org.

Use the which= argument to ScanBamParam() to specify specific regions. Using the file indicated on ?scanBam, we might

fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)
 countBam(fl)
##  space start end width    file records nucleotides
##1    NA    NA  NA    NA ex1.bam    3307      116551
countBam(fl, param = ScanBamParam(which = GRanges("seq1:1-1000")))
##  space start  end width    file records nucleotides
## 1  seq1     1 1000  1000 ex1.bam     924       32529

See ?BamFile and this part of the example

     ## Use 'yieldSize' to iterate through a file in chunks.
     bf <- open(BamFile(fl, yieldSize=1000))
     while (nrec <- length(scanBam(bf)[[1]][[1]]))
         cat("records:", nrec, "\n")
     close(bf)

for iterating through a bam file.

Probably something like ?GenomicAlignments::readGAlignments is more 'user friendly', and operates in the same way.

?GenomicFiles::reduceByYield and reduceByRange and reduceRanges might be relevant.

gevro commented 2 years ago

Thank you!