Closed sa-lee closed 6 years ago
Hi @sa-lee
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: plyranges
Type: Package
Title: A fluent interface for manipulating GenomicRanges
Version: 0.1.0
Authors@R: c(
person("Stuart", "Lee", email = "lee.s@wehi.edu.au", role = c("aut", "cre")),
person("Michael", "Lawrence", role = c("aut", "ctb")),
person("Dianne", "Cook", role = c("aut", "ctb"))
)
Maintainer: Stuart Lee <stuart.lee1@monash.edu>
Description: A dplyr-like interface for interacting with the common Bioconductor
classes Ranges and GenomicRanges. By providing a grammatical
and consistent way of manipulating these classes their accessiblity for new
Bioconductor users is hopefully increased.
Depends:
R (>= 3.4.2),
methods,
BiocGenerics,
IRanges (>= 2.12.0),
GenomicRanges (>= 1.28.4)
Imports:
dplyr,
rlang (>= 0.2.0),
magrittr,
tidyr,
tidyselect,
rtracklayer,
GenomicAlignments,
GenomeInfoDb,
Rsamtools,
S4Vectors,
utils
biocViews: Infrastructure, DataRepresentation, WorkflowStep, Coverage
License: Artistic-2.0
Encoding: UTF-8
ByteCompile: true
Suggests:
knitr,
BiocStyle,
rmarkdown,
testthat,
ggplot2,
HelloRanges,
HelloRangesData,
BSgenome.Hsapiens.UCSC.hg19,
pasillaBamSubset,
covr
VignetteBuilder: knitr
RoxygenNote: 6.0.1
A reviewer has been assigned to your package Learn What to Expect during the review process.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
ae0000c version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
6fc1d7e version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
fa9078e doc fixes (clean up links); change R version; pkg ...
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
I'm only getting an error now about not being subscribed to dev mailing list (I've signed up but not received a notification yet) but after that I think the packages should be able to complete the check on all three platforms. Looking forward to the review process 😄
Received a valid push; starting a build. Commits are:
d00faf9 always forget that version bump :)
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
I think the warning is more to do with windows build machine than any mistake on my end and I am not sure how to fix it - any ideas?
You can ignore that warning for now - I will take a look at the package soon and have a review by early next week at the latest.
Thanks @lshep! Looking forward to it. Pinging @lawremi so he can be kept in the loop too.
This package looks good and really interesting. Minor note: Please use git rm
and remove the plyranges.Rproj from being tracked in the git tree. You will have to commit this changes as well. You can have this file locally but it should be in the .gitignore and not version controlled. @hpages any comments from the GRanges/IRanges points of view?
Received a valid push; starting a build. Commits are:
c98bc52 version bump; no longer tracking .Rproj
Thanks! I've removed the .Rproj and added it to .gitignore
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hi @sa-lee ,
Lori (@lshep ) will provide a formal review.
I took a quick glance at plyranges and noticed 3 things:
Some important changes were made in BioC 3.7 to the IRanges and GRanges class hierarchies. In particular the Ranges class in BioC 3.6 got renamed IntegerRanges in BioC 3.7. The Ranges class in BioC 3.7 is actually a new virtual class that is the common parent of the IntegerRanges, GenomicRanges, and GAlignments classes. We didn't have a common parent for these 3 classes before.
You probably want to revisit your use of Ranges in the light of that. For example, here you only need one method, the method for Ranges objects:
> plyranges:::set_width.Ranges
function (x, width = 0L)
{
width(x) <- width
x
}
<bytecode: 0x29e36918>
<environment: namespace:plyranges>
> plyranges:::set_width.GenomicRanges
function (x, width = 0L)
{
width(x) <- width
x
}
<bytecode: 0x1b693a60>
<environment: namespace:plyranges>
You have a lot of methods defined for Ranges and GenomicRanges objects. Everywhere the foo.Ranges
and foo.GenomicRanges
methods are identical (like with set_width.Ranges
and set_width.GenomicRanges
above), you only need the 1st one. It will work on any Ranges derivative for which width<-
works. If the methods are not identical, like in the case of reduce_ranges.Ranges
and reduce_ranges.GenomicRanges
, it's probably safer to replace reduce_ranges.Ranges
with reduce_ranges.IntegerRanges
since the reduce_ranges.Ranges
method is only guaranteed to do the right thing on an IntegerRanges derivative but not on a range-based object that inherits from Ranges in general.
The anchoring mechanism only works properly on IRanges and GRanges instances. When used on an IRanges or GRanges derivative, it returns an IRanges or GRanges instance which is not what one would expect. For example, with a VRanges object (derives from GRanges):
> library(VariantAnnotation)
> example(VRanges)
> set_width(anchor_start(vr), 20)
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | tumorSpecific
<Rle> <IRanges> <Rle> | <logical>
[1] chr1 1-20 * | FALSE
[2] chr2 10-29 * | TRUE
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Why would you use ignore.strand=TRUE
in your methods when the corresponding "native" method defined in GenomicRanges uses ignore.strand=FALSE
by default? That means that by default disjoin_ranges()
, reduce_ranges()
, find_overlaps()
, join_nearest()
etc... will produce results inconsistent with disjoin()
, reduce()
, findOverlaps()
, nearest()
etc... which will probably cause a lot of confusion. I realize that you also provide disjoin_ranges_directed()
, reduce_ranges_directed()
, find_overlaps_directed()
, etc... for those users who actually want to perform the strand-aware operation. However it sounds like having things the other way around (i.e. the functions with the short names behave like the "native" methods and those with the long/specialized names diverge from what the "native" methods do) would avoid a lot of confusion.
H.
Thanks for this feedback. (3) was definitely intentional. The inconsistency is unfortunate, but this is research into designing an API from a blank slate, only without sacrificing compatibility with existing Bioconductor data structures.
One more thing:
> query <- GRanges(c(a="chr1:1-17", b="chr1:1-15"))
> subject <- GRanges(c("chr1:15-20", "chr1:50-60", "chr1:16-20"))
> find_overlaps(query, subject)
GRanges object with 3 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
a chr1 1-17 *
a chr1 1-17 *
b chr1 1-15 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Is this duplication expected? The man page for find_overlaps
suggests it's not:
'find_overlaps' will search for any overlaps between ranges x and y and return a ranges
object of the same length as x but with additional metadata colums corresponding to the
metadata columns in y.
Assuming it is (and that the man page is inaccurate): In the absence of information about what ranges in the subject are actually hit by "a", the user cannot really make any meaningful interpretation of the repetition of range "a" in the result (except that "range a got 2 hits"). If query contains unnamed duplicated ranges, the result becomes even harder to interpret:
> query <- GRanges(c("chr1:1-17", "chr1:1-16", "chr1:1-16"))
> subject <- GRanges(c("chr1:15-20", "chr1:50-60", "chr1:16-20"))
> find_overlaps(query, subject)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-17 *
[2] chr1 1-17 *
[3] chr1 1-16 *
[4] chr1 1-16 *
[5] chr1 1-16 *
[6] chr1 1-16 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
Of course it will help if subject
has a metadata column with some range identifier in it:
> mcols(subject)$label <- LETTERS[24:26]
> find_overlaps(query, subject)
GRanges object with 6 ranges and 1 metadata column:
seqnames ranges strand | label
<Rle> <IRanges> <Rle> | <character>
[1] chr1 1-17 * | X
[2] chr1 1-17 * | Z
[3] chr1 1-16 * | X
[4] chr1 1-16 * | Z
[5] chr1 1-16 * | X
[6] chr1 1-16 * | Z
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
However: (1) that's not always going to be the case, and (2) experience has shown that users are prompt to assume that a metadata column can be used as a unique range identifier when it's actually unsafe to do so.
H.
@lawremi What about having things the other way around for (3)?
Many people find the current strand-specific behavior to be surprising, so we are trying it the other way. Obviously, this is not the only inconsistency, and the inconsistencies will surprise some, but again, we're ignoring that for now, because it would just be too constraining otherwise.
Your comments will be very helpful. I'm going to chat with Stuart about them later today.
Thanks for your feedback @hpages - I'll definitely fix up your points (1) and (2). Regarding the overlaps the behavior is expected but agree that the documentation is unclear and in the case you are talking about a user would probably want to use group_by_overlaps()
. I'll try and fix that up.
When genomic features are stranded (e.g. exons), they are represented with a stranded GRanges (i.e. a GRanges where the strand is +
or -
). If they are not (e.g. genomic bins), they are represented with an unstranded GRanges (i.e. the strand is *
). By default findOverlaps()
will not consider that an exon on the plus strand overlaps with a transcript on the minus strand, which is the right thing to do. But it will report that my exon overlaps with some of the bins in the subject, whatever the strand of my exon is, which is also the right thing to do. Maybe that surprises some (I don't know about "many"), but there will always be people to be surprised by something. What is guaranteed is that using the opposite default for ignore.strand
will introduce even more surprise and confusion.
Said otherwise: If I'm feeding findOverlaps()
with a stranded query and subject, there is no reason why findOverlaps()
would assume that I made a mistake and that I meant to give it unstranded objects. This is like the data.frame()
constructor in base R assuming that I meant to give it a factor when I gave it a character vector. Not the greatest feature.
The North said: "we want ignore.strand=FALSE
". Now the South is saying "we want ignore.strand=TRUE
". Bioconductor wants to remain a united nation.
The simpler overlap constraint is the more intuitive default. The explicit function naming will make the behavior more obvious and minimize confusion.
My ranges are directed. Why do I need to call find_overlaps_directed
to operate on them? Why can't find_overlaps
see that my ranges are directed and treat them as such? Sorry but I don't find this intuitive.
Also what's intuitive about this:
> intersect_ranges(GRanges("chr2:11-30:+"), GRanges("chr2:21-35:-"))
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr2 21-30 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
The 1st feature is on the plus strand only. The 2nd feature is on the minus strand only. So they don't intersect.
I think we can agree that we have different opinions of what is intuitive. The real issue is whether the inconsistency is OK. Again, this package is exploring how we can improve on the current API, so conflicts are inevitable.
The output of intersect_ranges
is wrong. It's not an opinion.
Inconsistencies are not OK when they can easily be avoided. Isn't that the case here? What about exploring how you can improve on the current API without introducing this conflict? That will instantly increase a lot the value of your offer at a very small cost.
Received a valid push; starting a build. Commits are:
558b886 first pass anchor restructuring 38f6fea to IntegerRanges for Bioc3.7 d5e0db0 clarify overlaps documentation 4f4d63a delegating anchors class 65135dd code tidy up 4bb648e typo fix 738625b make sure class can be created on the fly 786c23a fixes bam tests 67d74f0 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
@lshep I've incorporated the suggestions above - is there anything else I should modify or change?
Except of course for the ignore.strand thing ;)
I will take a look at everything in the next day or two. Thank you.
Received a valid push; starting a build. Commits are:
12dfe97 refactor of AnchoredRanges 9ffab15 refactoring of GroupedRanges c2cb2c6 modifications to dplyr verbs based on group_by ref... d53ebf4 updates for delegating classes 909dfa5 changes for group_by refactoring 4df03a9 namespace changes for grouping and delegating 44787e6 namespace fixes; version bump 0e88638 split groups function no longer required
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
The code in general is in very good shape and well documented. My only note is that there is still a lot of repeated code that could (and should) be avoided.
An example in dplyr-select.R select.DelegatingIntegerRanges and select.DelegatingGenomicRanges is the exact same code. We would urge to do something like keep select.DelegatingGenomicRanges as is and then redefine select.DelegatingIntegerRanges as
select.DelegatingIntegerRanges <- select.DelegatingGenomicRanges
Another example with the assist from Martin - the dplyr-mutate.R mutate.GroupedIntegerRanges and mutate.GroupedGenomicRanges is the same code except in the new() call - so a rewrite similar to something of the like where the majority of code is only in one function
.mutate <- function(.data, ..., NEW) {
dots <- quos(...)
delegate <- .data@delegate
inx <- .data@inx
rng <- lapply(inx, function(i) {
x <- delegate[i]
mutate_rng(x, dots)
})
rng <- bind_ranges(rng)[order(unlist(inx))]
NEW(.data, rng, inx)
}
.new.GroupedGenomicRanges <- function(.data, rng, inx)
new("GroupedGenomicRanges",
elementMetadata = .data@elementMetadata,
delegate = rng,
groups = groups(.data),
inx = inx)
.new.GroupedIntegerRanges <- function(.data, rng, inx)
new("GroupedIntegerRanges",
delegate = rng,
groups = groups(.data),
inx = inx)
#' @rdname mutate-ranges
#' @method mutate GroupedGenomicRanges
#' @export
mutate.GroupedGenomicRanges <- function(.data, ...)
.mutate(.data, ..., NEW = .new.GroupedGenomicRanges)
#' @rdname mutate-ranges
#' @method mutate GroupedIntegerRanges
#' @export
mutate.GroupedIntegerRanges <- function(.data, ...) {
.mutate(.data, ..., NEW = .new.GroupedIntegerRanges)
Repeated code should be avoided - it makes code difficult to read and also leaves opportunity to have to fix or modify code in multiple locations. This should be applied throughout the R code however you deem best.
Cheers, Lori
Maybe it could even simpler. It's not clear why GroupedGenomicRanges needs to populate its @elementMetadata
. Can't that just be left empty, with the metadata being stored by the delegate? In other words, there may be no need for NEW=
; instead, just use initialize()
.
Thanks for the suggestions @lshep! I'll get onto it today. @lawremi - I needed to populate the elementMetadata
slot because otherwise I got a validObject
error .
The problem is that GenomicRanges recently came to assert DataFrame (disallowing NULL) for the type of @elementMetadata
, so we get an illegal, empty DataFrame from the prototype. Since every Vector will have @elementMetadata
, there is basically no clean way to make a delegating Vector derivative. In the devel S4Vectors, I changed the validity function to call mcols(x)
instead of using direct slot access. I do like keeping validity checks at the slot level, but since there is no "pure virtual" Vector analog, this is a reasonable compromise.
Ok that makes sense - I'll update with an initialize method then.
Received a valid push; starting a build. Commits are:
bd3adc6 reducing code duplication; requires S4Vectors >= 0... 7a554d7 reduce code repetition 840faeb version bump 483869e update namespace b4d8595 significant refactoring to reduce code duplication 60dbd49 fixes and refactoring for reduce methods
I've refactored quite a lot of the code now to reduce duplication. Thanks for your suggestions @lshep!
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.
Thank you for contributing to Bioconductor!
The master branch of your GitHub repository has been added to Bioconductor's git repository.
To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/sa-lee.keys is not empty), then no further steps are required. Otherwise, do the following:
See further instructions at
https://bioconductor.org/developers/how-to/git/
for working with this repository. See especially
https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/
to keep your GitHub and Bioconductor repositories in sync.
Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at
https://bioconductor.org/checkResults/
(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"plyranges\")
. The package 'landing page' will be created at
https://bioconductor.org/packages/plyranges
If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[x] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[x] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[x] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[x] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[x] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.