Bioconductor / RaggedExperiment

Matrix-like representations of mutation and CN data
https://bioconductor.org/packages/RaggedExperiment
4 stars 3 forks source link

qreduceAssay: inconsistency in simplifyReduce #20

Closed lgeistlinger closed 5 years ago

lgeistlinger commented 6 years ago

I think there is a small bug in qreduceAssay concerning the simplifyReduce argument.

Following the corresponding Arguments section in ?qreduceAssay, the simplifyReduce argument is a function / functional accepting arguments score, range, and qrange.

Where qrange

is a GRanges instance with the same length as score, providing the query range window to which the corresponding scores apply

However, when following through the code of qreduceAssay:

> qreduceAssay
function (x, query, simplifyReduce, i = 1, withDimnames = TRUE,
    background = NA_integer_)
{ 
  1     if (missing(i) && ncol(.mcols(x)) == 0)
  2         return(matrix(NA, 0, 0))
  3     stopifnot_simplify_ok(simplifyReduce, 3L)
  4     i <- .assay_i(x, i)
  5     mcol <- .mcols(x)[[i]][.rowidx(x)]
  6     dim <- .dim(x)
  7     subject <- unname(rowRanges(x))
  8     query <- granges(query)
  9     olap <- findOverlaps(query, subject)
 10     sidx <- subjectHits(olap)
 11     row <- queryHits(olap)
 12     col <- rep(seq_len(dim[[2]]), lengths(.assays(x)))[.rowidx(x)][sidx]
 13     score <- mcol[sidx]
 14     subject <- subject[sidx]
 15     qranges <- query[row]
 16     ranges <- restrict(subject, start = pmax(start(qranges),
 17         start(subject)), end = pmin(end(qranges), end(subject)))
 18     group <- (row - 1L) * max(col, 0) + col
 19     group <- match(group, unique(group))
 20     result <- simplifyReduce(unname(splitAsList(score, group)),
 21         unname(splitAsList(ranges, group)), unname(qranges))
 22     group <- !duplicated(group)
 23     na <- as(background, class(result))
 24     dimnames <- list(NULL, NULL)
 25     if (withDimnames)
 26         dimnames <- list(as.character(query), .dimnames(x)[[2]])
 27     m <- matrix(na, nrow = length(query), ncol = dim[[2]], dimnames = dimnames)
 28     idx <- cbind(row = row[group], col = col[group])
 29     m[idx] <- result
 30     m[, .colidx(x), drop = FALSE]
}

using the example RaggedExperiment from ?qreduceAssay

x <- RaggedExperiment(GRangesList(
         GRanges(c(A = "chr1:1-10:-", B = "chr1:8-14:-", C = "chr2:15-18:+"),
             score = 3:5),
         GRanges(c(D = "chr1:1-10:-", E = "chr2:11-18:+"), score = 1:2)
     ), colData = DataFrame(id = 1:2))

query <- GRanges(c("chr1:1-14:-", "chr2:11-18:+"))

and plug that into lines 1-19 of the above code of qreduceAssay, you end up (in line 20) with the three arguments of simplifyReduce be set to:

> score <- unname(splitAsList(score, group))
> score
IntegerList of length 4
[[1]] 3 4
[[2]] 1
[[3]] 5
[[4]] 2

> range <- unname(splitAsList(ranges, group))
> range
GRangesList object of length 4:
[[1]]
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-10      -
  [2]     chr1      8-14      -

[[2]]
GRanges object with 1 range and 0 metadata columns:
      seqnames ranges strand
  [1]     chr1   1-10      -

[[3]]
GRanges object with 1 range and 0 metadata columns:
      seqnames ranges strand
  [1]     chr2  15-18      +

...
<1 more element>
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths

> qrange <- unname(qranges)
> qrange
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-14      -
  [2]     chr1      1-14      -
  [3]     chr1      1-14      -
  [4]     chr2     11-18      +
  [5]     chr2     11-18      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

and thus with

> length(score) == length(qrange)
[1] FALSE

which is inconsistent with the description above

qrange is a GRanges instance with the same length as score

which is the actual desired behavior.

Closer inspecting this case, you find that by collecting all queryHits in line 11 and 15 above, qranges contains also duplicated query hits within a group of subject ranges belonging to the same query range.

These groups are defined in group, line 19:

> group
[1] 1 1 2 3 4

A fix could thus look like this:

> qreduceAssay
function (x, query, simplifyReduce, i = 1, withDimnames = TRUE,
    background = NA_integer_)
{
  1     if (missing(i) && ncol(.mcols(x)) == 0)
  2         return(matrix(NA, 0, 0))
  3     stopifnot_simplify_ok(simplifyReduce, 3L)
  4     i <- .assay_i(x, i)
  5     mcol <- .mcols(x)[[i]][.rowidx(x)]
  6     dim <- .dim(x)
  7     subject <- unname(rowRanges(x))
  8     query <- granges(query)
  9     olap <- findOverlaps(query, subject)
 10     sidx <- subjectHits(olap)
 11     row <- queryHits(olap)
 12     col <- rep(seq_len(dim[[2]]), lengths(.assays(x)))[.rowidx(x)][sidx]
 13     score <- mcol[sidx]
 14     subject <- subject[sidx]
 15     qranges <- query[row]
 16     ranges <- restrict(subject, start = pmax(start(qranges),
 17         start(subject)), end = pmin(end(qranges), end(subject)))
 18     group <- (row - 1L) * max(col, 0) + col
 19     group <- match(group, unique(group))

## fix ##
        ugroup <- !duplicated(group)
        result <- simplifyReduce(unname(splitAsList(score, group)),
          unname(splitAsList(ranges, group)), unname(qranges)[ugroup])
        group <- ugroup
##

 23     na <- as(background, class(result))
 24     dimnames <- list(NULL, NULL)
 25     if (withDimnames)
 26         dimnames <- list(as.character(query), .dimnames(x)[[2]])
 27     m <- matrix(na, nrow = length(query), ncol = dim[[2]], dimnames = dimnames)
 28     idx <- cbind(row = row[group], col = col[group])
 29     m[idx] <- result
 30     m[, .colidx(x), drop = FALSE]
}

I can open a PR if needed.

LiNk-NY commented 5 years ago

Hi Ludwig, @lgeistlinger

If you do debugonce(qreduceAssay) and then qreduceAssay(x, query, weightedmean), you'll see that they are all the same length:

Browse[2]> group
[1] 1 1 2 3 4
Browse[2]> ranges
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-10      -
  [2]     chr1      8-14      -
  [3]     chr1      1-10      -
  [4]     chr2     15-18      +
  [5]     chr2     11-18      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
Browse[2]> score
[1] 3 4 1 5 2
Browse[2]> qranges
GRanges object with 5 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-14      -
  [2]     chr1      1-14      -
  [3]     chr1      1-14      -
  [4]     chr2     11-18      +
  [5]     chr2     11-18      +
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths

Perhaps it was fixed after the issue was filed and we didn't close the issue?

Thanks!

-MR

lgeistlinger commented 5 years ago

I think the problem still exists. It is true that group, ranges, score, and qranges have all the same length in the body of qreduceAssay.

However, I'm talking here about the arguments of simplifyReduce in line 20/21 of the above definition of qreduceAssay, where

score [1st arg of simplifyReduce] takes unname(splitAsList(score, group)) range [2nd arg of simplifyReduce] takes unname(splitAsList(ranges, group)) qrange [3rd arg of simplifyReduce] takes unname(qranges))

If you accordingly evalutate the right hand sides during debugonce(qreduceAssay), you note the described length inconsistency of the args of simplifyReduce - in particular between args score (length 4) and qrange (length 5).

LiNk-NY commented 5 years ago

I suppose it should say:

with the same length as unlist(score)

because we use some tricks to group score values and calculate a mean (for example)..