Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

Using both "yileldSize" and "which" parameters #38

Closed gacatag closed 2 years ago

gacatag commented 2 years ago

Dear developers, Hello.

I was wondering if the "yileldSize" in BamFile() is ignored when importing reads using readGAlignmentsList(), if also "which" parameter is defined in scanBamFlag(). In other words if we define "which" in scanBamFlag() function, readGAlignmentsList() will just import all the reads that have been mapped to the Genomic coordinates defined in which. Is there anyway to run readGAlignmentsList() so that it would import reads mapping to the coordinates defined using "which" parameter of scanBamFlag(), and load at maximum only 100,000 of these reads (i.e. considering the yieldSize=100000 param setting in BamFile()). So would this sequence of codes take the yieldSize=100000 into account and load only 100000 paired reads ?

bf<- Rsamtools::BamFile(bamFile, yieldSize=100000, 
  asMates=TRUE)

scParam=Rsamtools::ScanBamParam(
  what=Rsamtools::scanBamWhat()[c(1,
    3,5,8,13,9, 10, 6, 4, 14, 15)], 
  flag=Rsamtools::scanBamFlag(isPaired=TRUE,
    isDuplicate=NA))

yld<- GenomicAlignments::readGAlignmentPairs(bf, 
  param=scParam)

Cheers,

Ali

mtmorgan commented 2 years ago

The help page ?BamFile says

  yieldSize: Number of records to yield each time the file is read
      from using 'scanBam' or, when 'length(bamWhich()) != 0', a
      threshold which yields records in complete ranges whose sum
      first exceeds 'yieldSize'.

so it will read in 100000 reads + the number to 'complete' the range. Remember to open() the bam file. Here is an example (though without paired end reads)

> fl <- system.file("extdata", "ex1.bam", package="Rsamtools")
> bf = BamFile(fl, yieldSize = 3)
> param = ScanBamParam(which = GRanges(c("seq1:5", "seq2:1524")))
> open(bf)
> readGAlignments(bf, param = param) |> length()
[1] 3
> readGAlignments(bf, param = param) |> length()
[1] 43
> readGAlignments(bf, param = param) |> length()
[1] 0
gacatag commented 2 years ago

Thank you very much for the details. Is there any way to set the parameters so that it would read, at max, 3 reads only from the ranges defined by "which", every time the readGAlignments() is called ?

mtmorgan commented 2 years ago

No.

It is better to ask questions like this on the support site https://suppport.bioconductor.org