Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

countBam with empty GRanges object should return zero counts #28

Closed smped closed 1 year ago

smped commented 3 years ago

Hi,

I have a scenario where I'm loading bed files and counting the alignments within the ranges provided by the bed files. These bed files may sometimes contain no ranges. However, when I pass the empty GRanges object to countBam() via scanBamParam() the entire bam file is counted. To my understanding, no explicit range should return zero. Passing a NULL object would justify counting the entire file, but I'm not convinced passing a GRanges object with no ranges should behave this way.

An example is below:

fl <- system.file("extdata", "ex1.bam", package="Rsamtools", mustWork=TRUE)                                           
gr <- GRanges()                                                            
countBam(fl, param = ScanBamParam(which = gr))

This returns the following which appears incorrect to me.

space start end width    file records nucleotides
1    NA    NA  NA    NA ex1.bam    3307      116551

I can work around this easily enough using an if statement, but it just appeared to be a logical inconsistency. Could the test from line 5 in https://github.com/Bioconductor/Rsamtools/blob/master/R/countBam.R instead test for a NULL instead of a zero length?

Thanks in advance,

Steve

vjcitn commented 3 years ago

@mtmorgan ^^

mtmorgan commented 3 years ago

I believe the behavior has been around 'forever' and could be quite disruptive if changed. I'll investigate over the next couple of weeks.

smped commented 3 years ago

Thanks Martin. Much appreciated. I suspect it may be pretty deeply embedded into multiple dependent packages. Thought it worth asking though.

Shians commented 1 year ago

I'd like to bump this issue and also extend the requested behaviour to other functions like scanTabix(). The use case is that I have to pre-filter chromosomes not in the tabix index, and sometimes this results in no regions remaining which triggers a full read of the data (often larger than memory). It certainly wasn't the expected behaviour and isn't mentioned in documentations, perhaps some extra documentation or a new flag to disable the full parsing behaviour is required.

mtmorgan commented 1 year ago

I'm not likely to implement this. Happy to take a pull request updating documentation or, more ambitiously, code.

Shians commented 1 year ago

Pull #54 updates the documentation. I'm still keen to try and modify behaviour when I have time. I think something like a new parameter onEmpty = c("all", "none") for countBam() and scanTabix() would be nice. From a very brief deciphering of the code, I believe both accepted param types GRanges and IntegerRangesList have a length() value that can be checked for 0 to return either some 0 length object or 0.