Closed alexandremkuhn closed 6 years ago
Hi @alexandremkuhn,
Thanks for the pull request.
I would just go for something like this:
equisplit2 <- function(x, nchunk, chunksize)
{
q <- breakInChunks(sum(width(x)), chunksize, nchunk)
s <- PartitioningByWidth(width(x))
hits <- findOverlaps(q, s)
unlisted_ans <- x[subjectHits(hits)]
start(unlisted_ans) <- start(unlisted_ans) +
pmax(start(q)[queryHits(hits)] - start(s)[subjectHits(hits)], 0L)
end(unlisted_ans) <- end(unlisted_ans) -
pmax(end(s)[subjectHits(hits)] - end(q)[queryHits(hits)], 0L)
relist(unlisted_ans, as(hits, "PartitioningByEnd"))
}
It's simpler, faster, and no need to make it a generic (works on any range-based object that supports width()
, [
, start<-()
, end<-()
, and relist()
, e.g. IRanges and GRanges objects). It also does a better job when nchunk
is > sum(width(x)
) (I think equisplit()
should always return a GRangesList of the length specified via nchunk
even when nchunk > sum(width(x)
), and with names. Note that it doesn't fit well in the group of "inter range transformations" because it's not an endomorphism so I would just put it in the same file and same man page as breakInChunks()
in IRanges (i.e. in R/IRanges-utils.R
and man/IRanges-utils.Rd
). Finally, adding unit tests would be great (in inst/unitTests/test_IRanges-utils.R
, doesn't exist yet).
Thanks, H.
I like that equisplit2()
. Could we express it more clearly using pintersect()
and shift()
?
Using narrow()
instead of start<-
and end<-
to trim unlisted_ans
on both sides with a single call:
equisplit3 <- function(x, nchunk, chunksize)
{
q <- breakInChunks(sum(width(x)), chunksize, nchunk)
s <- PartitioningByWidth(width(x))
hits <- findOverlaps(q, s)
unlisted_ans <- x[subjectHits(hits)]
Ltrim <- pmax(start(q)[queryHits(hits)] - start(s)[subjectHits(hits)], 0L)
Rtrim <- pmax(end(s)[subjectHits(hits)] - end(q)[queryHits(hits)], 0L)
unlisted_ans <- narrow(unlisted_ans, start=1L+Ltrim, end=-1L-Rtrim)
relist(unlisted_ans, as(hits, "PartitioningByEnd"))
}
With this change the requirement now is that x
only needs to support width()
, [
, narrow()
, and relist()
. So equisplit3()
also works on GAlignments and DNAStringSet objects:
library(Biostrings)
equisplit3(DNAStringSet(c("AAATTT", "CCGGGGGGGG")), 4)
# DNAStringSetList of length 4
# [[1]] AAAT
# [[2]] TT CC
# [[3]] GGGG
# [[4]] GGGG
An alternate way to compute Ltrim
and Rtrim
:
ovranges <- overlapsRanges(q, s, hits)
Ltrim <- start(ovranges) - start(s)[subjectHits(hits)]
Rtrim <- end(s)[subjectHits(hits)] - end(ovranges)
Does this express things more clearly @lawremi ? Would it help if overlapsRanges()
was named rangesOfOverlaps()
instead?
rangesOfOverlaps()
sounds OK to me. Similar to its name when it was ranges,Hits()
.
Hi Hervé @hpages and Michael @lawremi,
Thanks for looking at equisplit. Your generalized equisplit3 seems great. What is the next step, would you add it to IRanges?
Alexandre
Yep I can take care of it. Will let you know when it's done.
Hi @alexandremkuhn, @lawremi, This is now in IRanges 2.13.16: https://github.com/Bioconductor/IRanges Cheers, H.
Great, thanks! I'll close the PR then.
Following a previous pull request (https://github.com/lawremi/VariantTools/pull/1), Michael Lawrence suggested to add an equisplit function (named epGRanges in the previous PR) to GenomicRanges (instead of VariantTools).
Proposed changes include an equisplit generic in S4Vectors, an equisplit function (for GenomicRanges objects) in GenomicRanges and minor changes in VariantTools to accommodate param@bamTallyParam@which as GRangesList (in addition to GenomicRanges).