Bioconductor / GenomicRanges

Representation and manipulation of genomic intervals
https://bioconductor.org/packages/GenomicRanges
43 stars 18 forks source link

Equisplit #3

Closed alexandremkuhn closed 6 years ago

alexandremkuhn commented 6 years ago

Following a previous pull request (https://github.com/lawremi/VariantTools/pull/1), Michael Lawrence suggested to add an equisplit function (named epGRanges in the previous PR) to GenomicRanges (instead of VariantTools).

Proposed changes include an equisplit generic in S4Vectors, an equisplit function (for GenomicRanges objects) in GenomicRanges and minor changes in VariantTools to accommodate param@bamTallyParam@which as GRangesList (in addition to GenomicRanges).

hpages commented 6 years ago

Hi @alexandremkuhn,

Thanks for the pull request.

I would just go for something like this:

equisplit2 <- function(x, nchunk, chunksize)
{
    q <- breakInChunks(sum(width(x)), chunksize, nchunk)
    s <- PartitioningByWidth(width(x))
    hits <- findOverlaps(q, s)
    unlisted_ans <- x[subjectHits(hits)]
    start(unlisted_ans) <- start(unlisted_ans) +
        pmax(start(q)[queryHits(hits)] - start(s)[subjectHits(hits)], 0L)
    end(unlisted_ans) <- end(unlisted_ans) -
        pmax(end(s)[subjectHits(hits)] - end(q)[queryHits(hits)], 0L)
    relist(unlisted_ans, as(hits, "PartitioningByEnd"))
}

It's simpler, faster, and no need to make it a generic (works on any range-based object that supports width(), [, start<-(), end<-(), and relist(), e.g. IRanges and GRanges objects). It also does a better job when nchunk is > sum(width(x)) (I think equisplit() should always return a GRangesList of the length specified via nchunk even when nchunk > sum(width(x)), and with names. Note that it doesn't fit well in the group of "inter range transformations" because it's not an endomorphism so I would just put it in the same file and same man page as breakInChunks() in IRanges (i.e. in R/IRanges-utils.R and man/IRanges-utils.Rd). Finally, adding unit tests would be great (in inst/unitTests/test_IRanges-utils.R, doesn't exist yet).

Thanks, H.

lawremi commented 6 years ago

I like that equisplit2(). Could we express it more clearly using pintersect() and shift()?

hpages commented 6 years ago

Using narrow() instead of start<- and end<- to trim unlisted_ans on both sides with a single call:

equisplit3 <- function(x, nchunk, chunksize)
{
    q <- breakInChunks(sum(width(x)), chunksize, nchunk)
    s <- PartitioningByWidth(width(x))
    hits <- findOverlaps(q, s)
    unlisted_ans <- x[subjectHits(hits)]
    Ltrim <- pmax(start(q)[queryHits(hits)] - start(s)[subjectHits(hits)], 0L)
    Rtrim <- pmax(end(s)[subjectHits(hits)] - end(q)[queryHits(hits)], 0L)
    unlisted_ans <- narrow(unlisted_ans, start=1L+Ltrim, end=-1L-Rtrim)
    relist(unlisted_ans, as(hits, "PartitioningByEnd"))
}

With this change the requirement now is that x only needs to support width(), [, narrow(), and relist(). So equisplit3() also works on GAlignments and DNAStringSet objects:

library(Biostrings)
equisplit3(DNAStringSet(c("AAATTT", "CCGGGGGGGG")), 4)
# DNAStringSetList of length 4
# [[1]] AAAT
# [[2]] TT CC
# [[3]] GGGG
# [[4]] GGGG
hpages commented 6 years ago

An alternate way to compute Ltrim and Rtrim:

ovranges <- overlapsRanges(q, s, hits)
Ltrim <- start(ovranges) - start(s)[subjectHits(hits)]
Rtrim <- end(s)[subjectHits(hits)] - end(ovranges)

Does this express things more clearly @lawremi ? Would it help if overlapsRanges() was named rangesOfOverlaps() instead?

lawremi commented 6 years ago

rangesOfOverlaps() sounds OK to me. Similar to its name when it was ranges,Hits().

alexandremkuhn commented 6 years ago

Hi Hervé @hpages and Michael @lawremi,

Thanks for looking at equisplit. Your generalized equisplit3 seems great. What is the next step, would you add it to IRanges?

Alexandre

hpages commented 6 years ago

Yep I can take care of it. Will let you know when it's done.

hpages commented 6 years ago

Hi @alexandremkuhn, @lawremi, This is now in IRanges 2.13.16: https://github.com/Bioconductor/IRanges Cheers, H.

alexandremkuhn commented 6 years ago

Great, thanks! I'll close the PR then.