Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
41 stars 17 forks source link

subtract documentation #83

Open gevro opened 6 months ago

gevro commented 6 months ago

Hi, subtract says that metadata columns of x are propagated, but that doesn't seem to be happening: "The names and metadata columns on x are propagated to the result."

> gr2 <- GRanges("a:1-10",score=10)
> gr1 <- GRanges("a:2-9")
> subtract(gr2,gr1)
GRangesList object of length 1:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a         1      *
  [2]        a        10      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

I think either the documentation is wrong, or subtract code is wrong.

Thanks

hpages commented 5 months ago

You don't see them but the metadata columns are here:

library(GenomicRanges)

gr1 <- GRanges(c("a:1-10", "a:8-20"), score=10:11)
gr2 <- GRanges("a:2-9")
result <- subtract(gr1, gr2)

mcols(result)
# DataFrame with 2 rows and 1 column
#      score
#   <integer>
# 1        10
# 2        11

As explained in the man page (?subtract) the result is a GRangesList object with the metadata columns of the first argument (gr1) on it. The gotcha here is that we do NOT display the top-level metadata columns of a GRangesList object (there's no easy way to do that), only its inner metadata columns:

grl <- GRangesList(
    GENE1=GRanges("chr1", IRanges(11:15, 20), tx_id=1:5),
    GENE2=GRanges("chr2", IRanges(4:1, 10, tx_id=6:9))
)
mcols(grl)$gene_alias <- c("foo", "bar")

grl
# GRangesList object of length 2:
# $GENE1
# GRanges object with 5 ranges and 1 metadata column:
#       seqnames    ranges strand |     tx_id
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chr1     11-20      * |         1
#   [2]     chr1     12-20      * |         2
#   [3]     chr1     13-20      * |         3
#   [4]     chr1     14-20      * |         4
#   [5]     chr1     15-20      * |         5
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $GENE2
# GRanges object with 4 ranges and 1 metadata column:
#       seqnames    ranges strand |     tx_id
#          <Rle> <IRanges>  <Rle> | <integer>
#   [1]     chr2      4-10      * |         6
#   [2]     chr2      3-10      * |         7
#   [3]     chr2      2-10      * |         8
#   [4]     chr2      1-10      * |         9
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

mcols(grl)
# DataFrame with 2 rows and 1 column
#        gene_alias
#       <character>
# GENE1         foo
# GENE2         bar

Does this help?

gevro commented 5 months ago

Makes sense. But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?

hpages commented 5 months ago

But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?

Because each range in the 1st GRanges object can produce several fragments after the subtraction so we use a GRangesList representation to group the fragments that are coming from the same original range together. Like here where the subtraction breaks the first range in x in 2 fragments:

x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"))
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)
z
# GRangesList object of length 3:
# $A
# GRanges object with 2 ranges and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1      1-20      *
#   [2]     chr1     26-37      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $B
# GRanges object with 0 ranges and 0 metadata columns:
#    seqnames    ranges strand
#       <Rle> <IRanges>  <Rle>
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths
# 
# $C
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chrX     1-500      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

By doing this, the operation preserves positionality i.e. the i-th element in the output corresponds to the i-th element in the input (i.e. in the first GRanges object). Another way to describe this is to say that the output is parallel to the input, which is a nice property.

If you don't like or don't need the grouping then you can unlist() the result as shown in the examples from the man page:

unlist(z)
# GRanges object with 3 ranges and 0 metadata columns:
#     seqnames    ranges strand
#        <Rle> <IRanges>  <Rle>
#   A     chr1      1-20      *
#   A     chr1     26-37      *
#   C     chrX     1-500      *
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

But then you loose positionality and the top-level metadata (if the GRangesList object had any).

gevro commented 5 months ago

Understood. Thanks very much for explaining. The only issue is what if someone wanted a GRanges output while also preserving the metadata columns.

In a prior github issue you had proposed the below code, which does the above, but I was wondering whether the subtract function has a way to do this? Based on what you are saying, it seems not?

GRanges_subtract <- function(gr1, gr2, ignore.strand=FALSE){
    gr2 <- reduce(gr2, ignore.strand=ignore.strand)
    hits <- findOverlaps(gr1, gr2, ignore.strand=ignore.strand)
    ans <- psetdiff(gr1, extractList(gr2, as(hits, "IntegerList")))
    unlisted_ans <- unlist(ans, use.names=FALSE)
    mcols(unlisted_ans) <- extractROWS(mcols(gr1),Rle(seq_along(ans), lengths(ans)))
    unlist(setNames(relist(unlisted_ans, ans), names(gr1)))
}
hpages commented 5 months ago

In that case you need to manually propagate the top-level metadata columns of the GRangesList object to its unlisted form. This can be done with:

x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"), score=11:13)
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)

z2 <- unlist(z)  
mcols(z2) <- mcols(z)[rep.int(seq_along(z), lengths(z)), , drop=FALSE]
z2
# GRanges object with 3 ranges and 1 metadata column:
#     seqnames    ranges strand |     score
#        <Rle> <IRanges>  <Rle> | <integer>
#   A     chr1      1-20      * |        11
#   A     chr1     26-37      * |        11
#   C     chrX     1-500      * |        13
#   -------
#   seqinfo: 2 sequences from an unspecified genome; no seqlengths

This specialized version of unlist() could be factored in a dedicated helper function e.g.:

unlist_with_mcols <- function(x)
{
    stopifnot(is(x, "List"))
    ans <- unlist(x)
    mcols(ans) <- mcols(x)[rep.int(seq_along(x), lengths(x)), , drop=FALSE]
    ans
}

Using this, you'll get what you are after by doing just unlist_with_mcols(subtract(...)).

I should probably bake this functionality into subtract() itself by adding an unlist argument to it (would be FALSE by default to preserve the current behavior).

gevro commented 5 months ago

Yes, I think it would be helpful to have a parameter for subtract that allows to either unlist or not.