Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

Cannot unset strict.strand in intersect #61

Closed kewiechecki closed 2 years ago

kewiechecki commented 2 years ago

When I attempt to find the intersect of a set of stranded ranges and a set of unstranded ranges, ranges with strand "*" do not match ranges with strand "+" or "-".

library(GenomicRanges)

a <- GRanges(
    paste0('chr',as.character(1:10)),
    IRanges(
        rep(1,10),
        rep(100,10)
    ),
    rep('+',10)
)

b <- GRanges(
    paste0('chr',as.character(c(1:10,1:10))),
    IRanges(
        c(rep(1,10),rep(90,10)),
        c(rep(10,10),rep(110,10))
    ),
    rep('*',20)
)

intersect(a,b)

gives the unexpected output

GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: 10 sequences from an unspecified genome; no seqlengths

When ignore.strand=T, it gives the expected output.

> intersect(a,b,ignore.strand=T)
GRanges object with 20 ranges and 0 metadata columns:
       seqnames    ranges strand
          <Rle> <IRanges>  <Rle>
   [1]     chr1      1-10      *
   [2]     chr1    90-100      *
   [3]     chr2      1-10      *
   [4]     chr2    90-100      *
   [5]     chr3      1-10      *
   ...      ...       ...    ...
  [16]     chr8    90-100      *
  [17]     chr9      1-10      *
  [18]     chr9    90-100      *
  [19]    chr10      1-10      *
  [20]    chr10    90-100      *
  -------
  seqinfo: 10 sequences from an unspecified genome; no seqlengths

This appears to be an issue with strict.strand defaulting to TRUE rather than FALSE (as indicated in the documentation for GenomicRanges::intersect), but when I attempt to unset it, the argument isn't recognized.

> intersect(a,b,strict.strand=F)
Error in .local(x, y, ...) : unused argument (strict.strand = FALSE)
> sessionInfo()
R version 4.0.4 (2021-02-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Pop!_OS 21.04

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.13.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2  IRanges_2.22.2
[4] S4Vectors_0.26.1     BiocGenerics_0.34.0  nvimcom_0.9-82

loaded via a namespace (and not attached):
[1] zlibbioc_1.34.0        compiler_4.0.4         XVector_0.28.0
[4] tools_4.0.4            GenomeInfoDbData_1.2.3 RCurl_1.98-1.5
[7] bitops_1.0-7
hpages commented 2 years ago

Hi,

According to ?GenomicRanges::intersect the strict.strand argument is supported by the pintersect() method for GRanges objects, not by the intersect() method.

Note that vector-wise set operations union(), intersect(), and setdiff() treat the strand strictly in the sense that +, -, * are considered to be 3 separate spaces. This means that ranges on + or - are not considered to intersect with ranges on *. This is consistent with reduce() which they use internally to turn the supplied GRanges objects x and y into "sets" of genomic positions (i.e. x and y are conceptually replaced with reduce(x) and reduce(y) before computing their union, intersection, or setdiff).

It would probably make sense to control this by adding the strict.strand argument to these methods. Feel free to submit a PR if you'd like to give this a shot.

In the meantime, a somewhat hacky solution is to "split" the ranges in y that are on *, that is, to replace each of them with 2 ranges, one on + and one on -. This would only be guaranteed to do the right thing if x doesn't also have ranges on *, I think.

Cheers, H.