Open gevro opened 6 months ago
You don't see them but the metadata columns are here:
library(GenomicRanges)
gr1 <- GRanges(c("a:1-10", "a:8-20"), score=10:11)
gr2 <- GRanges("a:2-9")
result <- subtract(gr1, gr2)
mcols(result)
# DataFrame with 2 rows and 1 column
# score
# <integer>
# 1 10
# 2 11
As explained in the man page (?subtract
) the result is a GRangesList object with the metadata columns of the first argument (gr1
) on it. The gotcha here is that we do NOT display the top-level metadata columns of a GRangesList object (there's no easy way to do that), only its inner metadata columns:
grl <- GRangesList(
GENE1=GRanges("chr1", IRanges(11:15, 20), tx_id=1:5),
GENE2=GRanges("chr2", IRanges(4:1, 10, tx_id=6:9))
)
mcols(grl)$gene_alias <- c("foo", "bar")
grl
# GRangesList object of length 2:
# $GENE1
# GRanges object with 5 ranges and 1 metadata column:
# seqnames ranges strand | tx_id
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr1 11-20 * | 1
# [2] chr1 12-20 * | 2
# [3] chr1 13-20 * | 3
# [4] chr1 14-20 * | 4
# [5] chr1 15-20 * | 5
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
#
# $GENE2
# GRanges object with 4 ranges and 1 metadata column:
# seqnames ranges strand | tx_id
# <Rle> <IRanges> <Rle> | <integer>
# [1] chr2 4-10 * | 6
# [2] chr2 3-10 * | 7
# [3] chr2 2-10 * | 8
# [4] chr2 1-10 * | 9
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
mcols(grl)
# DataFrame with 2 rows and 1 column
# gene_alias
# <character>
# GENE1 foo
# GENE2 bar
Does this help?
Makes sense. But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?
But then why is the output of subtract of two Geanges a GRangesList instead of GRanges?
Because each range in the 1st GRanges object can produce several fragments after the subtraction so we use a GRangesList representation to group the fragments that are coming from the same original range together. Like here where the subtraction breaks the first range in x
in 2 fragments:
x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"))
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)
z
# GRangesList object of length 3:
# $A
# GRanges object with 2 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-20 *
# [2] chr1 26-37 *
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
#
# $B
# GRanges object with 0 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
#
# $C
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chrX 1-500 *
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
By doing this, the operation preserves positionality i.e. the i-th element in the output corresponds to the i-th element in the input (i.e. in the first GRanges object). Another way to describe this is to say that the output is parallel to the input, which is a nice property.
If you don't like or don't need the grouping then you can unlist()
the result as shown in the examples from the man page:
unlist(z)
# GRanges object with 3 ranges and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# A chr1 1-20 *
# A chr1 26-37 *
# C chrX 1-500 *
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
But then you loose positionality and the top-level metadata (if the GRangesList object had any).
Understood. Thanks very much for explaining. The only issue is what if someone wanted a GRanges output while also preserving the metadata columns.
In a prior github issue you had proposed the below code, which does the above, but I was wondering whether the subtract function has a way to do this? Based on what you are saying, it seems not?
GRanges_subtract <- function(gr1, gr2, ignore.strand=FALSE){
gr2 <- reduce(gr2, ignore.strand=ignore.strand)
hits <- findOverlaps(gr1, gr2, ignore.strand=ignore.strand)
ans <- psetdiff(gr1, extractList(gr2, as(hits, "IntegerList")))
unlisted_ans <- unlist(ans, use.names=FALSE)
mcols(unlisted_ans) <- extractROWS(mcols(gr1),Rle(seq_along(ans), lengths(ans)))
unlist(setNames(relist(unlisted_ans, ans), names(gr1)))
}
In that case you need to manually propagate the top-level metadata columns of the GRangesList object to its unlisted form. This can be done with:
x <- GRanges(c(A="chr1:1-50", B="chr1:40-110", C="chrX:1-500"), score=11:13)
y <- GRanges(c("chr1:21-25", "chr1:38-150"))
z <- subtract(x, y)
z2 <- unlist(z)
mcols(z2) <- mcols(z)[rep.int(seq_along(z), lengths(z)), , drop=FALSE]
z2
# GRanges object with 3 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <integer>
# A chr1 1-20 * | 11
# A chr1 26-37 * | 11
# C chrX 1-500 * | 13
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
This specialized version of unlist()
could be factored in a dedicated helper function e.g.:
unlist_with_mcols <- function(x)
{
stopifnot(is(x, "List"))
ans <- unlist(x)
mcols(ans) <- mcols(x)[rep.int(seq_along(x), lengths(x)), , drop=FALSE]
ans
}
Using this, you'll get what you are after by doing just unlist_with_mcols(subtract(...))
.
I should probably bake this functionality into subtract()
itself by adding an unlist
argument to it (would be FALSE
by default to preserve the current behavior).
Yes, I think it would be helpful to have a parameter for subtract that allows to either unlist or not.
Hi, subtract says that metadata columns of x are propagated, but that doesn't seem to be happening: "The names and metadata columns on x are propagated to the result."
I think either the documentation is wrong, or subtract code is wrong.
Thanks