Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

coverage() with small weight values returns non-zero values over empty positions #39

Closed mdeber closed 4 years ago

mdeber commented 4 years ago

I've noticed that when using coverage() over IRanges/GRanges objects using the weight argument, very small but non-zero values can be returned for empty positions.

> gr <- GRanges("chr1", IRanges(1:3, 2:4))
> seqinfo(gr) <- Seqinfo(seqnames = "chr1", seqlengths = 1e3)
> gr
GRanges object with 3 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1       1-2      *
  [2]     chr1       2-3      *
  [3]     chr1       3-4      *
  -------
  seqinfo: 1 sequence from an unspecified genome
> coverage(gr)$chr1
integer-Rle of length 1000 with 4 runs
  Lengths:   1   2   1 996
  Values :   1   2   1   0
> coverage(gr, weight = c(0.01, 0.02, 0.03))$chr1
numeric-Rle of length 1000 with 5 runs
  Lengths:                     1                     1                     1                     1                   996
  Values :                  0.01                  0.03                  0.05                  0.03 -3.46944695195361e-18

Thanks

> sessionInfo()
R version 4.0.0 (2020-04-24)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GenomicRanges_1.41.1 GenomeInfoDb_1.25.0  IRanges_2.23.4       S4Vectors_0.27.5    
[5] BiocGenerics_0.35.2 

loaded via a namespace (and not attached):
 [1] rstudioapi_0.11        XVector_0.29.0         zlibbioc_1.35.0        R6_2.4.1              
 [5] fansi_0.4.1            tools_4.0.0            pkgbuild_1.0.8         cli_2.0.2             
 [9] withr_2.2.0            remotes_2.1.1          assertthat_0.2.1       rprojroot_1.3-2       
[13] crayon_1.3.4           processx_3.4.2         GenomeInfoDbData_1.2.3 callr_3.4.3           
[17] bitops_1.0-6           ps_1.3.3               RCurl_1.98-1.2         curl_4.3              
[21] glue_1.4.1             compiler_4.0.0         backports_1.1.7        prettyunits_1.1.1   

[Edit: not unique to GRanges objects]

hpages commented 4 years ago

Hi,

Unfortunately this is an artefact of floating point arithmetic and the algorithm used to compute the weighted coverage. See this old thread on our support site for a more detailed explanation and how to work around it.

Best, H.

mdeber commented 4 years ago

Ah, I did not find this, my apologies.

So I had already implemented a workaround like the one you described, but I have concerns about the potential for critical errors in this particular case (my own was really bad and easy to miss). I had assumed that the categorical lack of input coverage would necessitate zero weighted coverage, and unfortunately this assumption held in my unit tests.

With that in mind, I think a precision argument could be a really valuable feature addition, not least of which for the performance improvements you alluded to. Otherwise, I think it's worth considering if a note should be added to the documentation to alert users of precision artifacts within no-coverage regions.

Anyway, thanks for you help. All the best, Mike

hpages commented 4 years ago

Thanks Mike. These are good suggestions.

Hard to find anything on our support site so no worries. The search feature is basically broken. This is a known issue and it's been a problem for years. The good news is that the software running the support site will receive a major update soon (this summer) and I believe this will also include an improved search feature. Fingers crossed.

hpages commented 4 years ago

I added some clarifications about this in IRanges 2.23.7 (commit https://github.com/Bioconductor/IRanges/commit/c4653d6151215759aace4be26fa845a1bb41d129).

Cheers, H.

hpages commented 4 years ago

@mdeber After giving this more thoughts I decided to add a naive coverage() implementation (via method="naive") that is immune to floating point artefacts in the no-coverage regions. This is in IRanges 2.23.8 and GenomicRanges 1.41.4. The trade-off is between accuracy and speed. See ?coverage in IRanges for more information.