Bioconductor / Contributions

Contribute Packages to Bioconductor
134 stars 33 forks source link

plyranges #670

Closed sa-lee closed 6 years ago

sa-lee commented 6 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 6 years ago

Hi @sa-lee

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: plyranges
Type: Package
Title: A fluent interface for manipulating GenomicRanges
Version: 0.1.0
Authors@R: c(
      person("Stuart", "Lee", email = "lee.s@wehi.edu.au", role = c("aut", "cre")),
      person("Michael", "Lawrence", role = c("aut", "ctb")),
      person("Dianne", "Cook", role = c("aut", "ctb"))
    )
Maintainer: Stuart Lee <stuart.lee1@monash.edu>
Description: A dplyr-like interface for interacting with the common Bioconductor 
    classes Ranges and GenomicRanges. By providing a grammatical
    and consistent way of manipulating these classes their accessiblity for new
    Bioconductor users is hopefully increased.
Depends:
  R (>= 3.4.2),
  methods,
  BiocGenerics,
  IRanges (>= 2.12.0),
  GenomicRanges (>= 1.28.4)
Imports:
  dplyr, 
  rlang (>= 0.2.0), 
  magrittr,
  tidyr,
  tidyselect,
  rtracklayer,
  GenomicAlignments,
  GenomeInfoDb,
  Rsamtools,
  S4Vectors, 
  utils
biocViews: Infrastructure, DataRepresentation, WorkflowStep, Coverage
License: Artistic-2.0
Encoding: UTF-8
ByteCompile: true
Suggests: 
    knitr,
    BiocStyle,
    rmarkdown,
    testthat,
    ggplot2,
    HelloRanges,
    HelloRangesData,
    BSgenome.Hsapiens.UCSC.hg19,
    pasillaBamSubset,
    covr
VignetteBuilder: knitr
RoxygenNote: 6.0.1
bioc-issue-bot commented 6 years ago

A reviewer has been assigned to your package Learn What to Expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

ae0000c version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

6fc1d7e version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

fa9078e doc fixes (clean up links); change R version; pkg ...

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

sa-lee commented 6 years ago

I'm only getting an error now about not being subscribed to dev mailing list (I've signed up but not received a notification yet) but after that I think the packages should be able to complete the check on all three platforms. Looking forward to the review process 😄

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

d00faf9 always forget that version bump :)

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

sa-lee commented 6 years ago

I think the warning is more to do with windows build machine than any mistake on my end and I am not sure how to fix it - any ideas?

lshep commented 6 years ago

You can ignore that warning for now - I will take a look at the package soon and have a review by early next week at the latest.

sa-lee commented 6 years ago

Thanks @lshep! Looking forward to it. Pinging @lawremi so he can be kept in the loop too.

lshep commented 6 years ago

This package looks good and really interesting. Minor note: Please use git rm and remove the plyranges.Rproj from being tracked in the git tree. You will have to commit this changes as well. You can have this file locally but it should be in the .gitignore and not version controlled. @hpages any comments from the GRanges/IRanges points of view?

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

c98bc52 version bump; no longer tracking .Rproj

sa-lee commented 6 years ago

Thanks! I've removed the .Rproj and added it to .gitignore

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

hpages commented 6 years ago

Hi @sa-lee ,

Lori (@lshep ) will provide a formal review.

I took a quick glance at plyranges and noticed 3 things:

  1. Some important changes were made in BioC 3.7 to the IRanges and GRanges class hierarchies. In particular the Ranges class in BioC 3.6 got renamed IntegerRanges in BioC 3.7. The Ranges class in BioC 3.7 is actually a new virtual class that is the common parent of the IntegerRanges, GenomicRanges, and GAlignments classes. We didn't have a common parent for these 3 classes before.

    You probably want to revisit your use of Ranges in the light of that. For example, here you only need one method, the method for Ranges objects:

    > plyranges:::set_width.Ranges
    function (x, width = 0L)
    {
        width(x) <- width
        x
    }
    <bytecode: 0x29e36918>
    <environment: namespace:plyranges>
    > plyranges:::set_width.GenomicRanges
    function (x, width = 0L)
    {   
        width(x) <- width
        x
    }
    <bytecode: 0x1b693a60>
    <environment: namespace:plyranges>

    You have a lot of methods defined for Ranges and GenomicRanges objects. Everywhere the foo.Ranges and foo.GenomicRanges methods are identical (like with set_width.Ranges and set_width.GenomicRanges above), you only need the 1st one. It will work on any Ranges derivative for which width<- works. If the methods are not identical, like in the case of reduce_ranges.Ranges and reduce_ranges.GenomicRanges, it's probably safer to replace reduce_ranges.Ranges with reduce_ranges.IntegerRanges since the reduce_ranges.Ranges method is only guaranteed to do the right thing on an IntegerRanges derivative but not on a range-based object that inherits from Ranges in general.

  2. The anchoring mechanism only works properly on IRanges and GRanges instances. When used on an IRanges or GRanges derivative, it returns an IRanges or GRanges instance which is not what one would expect. For example, with a VRanges object (derives from GRanges):

    > library(VariantAnnotation)
    > example(VRanges)
    > set_width(anchor_start(vr), 20)
    GRanges object with 2 ranges and 1 metadata column:
          seqnames    ranges strand | tumorSpecific
             <Rle> <IRanges>  <Rle> |     <logical>
      [1]     chr1      1-20      * |         FALSE
      [2]     chr2     10-29      * |          TRUE
      -------
      seqinfo: 2 sequences from an unspecified genome; no seqlengths
  3. Why would you use ignore.strand=TRUE in your methods when the corresponding "native" method defined in GenomicRanges uses ignore.strand=FALSE by default? That means that by default disjoin_ranges(), reduce_ranges(), find_overlaps(), join_nearest() etc... will produce results inconsistent with disjoin(), reduce(), findOverlaps(), nearest() etc... which will probably cause a lot of confusion. I realize that you also provide disjoin_ranges_directed(), reduce_ranges_directed(), find_overlaps_directed(), etc... for those users who actually want to perform the strand-aware operation. However it sounds like having things the other way around (i.e. the functions with the short names behave like the "native" methods and those with the long/specialized names diverge from what the "native" methods do) would avoid a lot of confusion.

H.

lawremi commented 6 years ago

Thanks for this feedback. (3) was definitely intentional. The inconsistency is unfortunate, but this is research into designing an API from a blank slate, only without sacrificing compatibility with existing Bioconductor data structures.

hpages commented 6 years ago

One more thing:

> query <- GRanges(c(a="chr1:1-17", b="chr1:1-15"))
> subject <- GRanges(c("chr1:15-20", "chr1:50-60", "chr1:16-20"))
> find_overlaps(query, subject)
GRanges object with 3 ranges and 0 metadata columns:
    seqnames    ranges strand
       <Rle> <IRanges>  <Rle>
  a     chr1      1-17      *
  a     chr1      1-17      *
  b     chr1      1-15      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Is this duplication expected? The man page for find_overlaps suggests it's not:

'find_overlaps' will search for any overlaps between ranges x and y and return a ranges
object of the same length as x but with additional metadata colums corresponding to the
metadata columns in y.

Assuming it is (and that the man page is inaccurate): In the absence of information about what ranges in the subject are actually hit by "a", the user cannot really make any meaningful interpretation of the repetition of range "a" in the result (except that "range a got 2 hits"). If query contains unnamed duplicated ranges, the result becomes even harder to interpret:

> query <- GRanges(c("chr1:1-17", "chr1:1-16", "chr1:1-16"))
> subject <- GRanges(c("chr1:15-20", "chr1:50-60", "chr1:16-20"))
> find_overlaps(query, subject)
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr1      1-17      *
  [2]     chr1      1-17      *
  [3]     chr1      1-16      *
  [4]     chr1      1-16      *
  [5]     chr1      1-16      *
  [6]     chr1      1-16      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Of course it will help if subject has a metadata column with some range identifier in it:

> mcols(subject)$label <- LETTERS[24:26]
> find_overlaps(query, subject)
GRanges object with 6 ranges and 1 metadata column:
      seqnames    ranges strand |       label
         <Rle> <IRanges>  <Rle> | <character>
  [1]     chr1      1-17      * |           X
  [2]     chr1      1-17      * |           Z
  [3]     chr1      1-16      * |           X
  [4]     chr1      1-16      * |           Z
  [5]     chr1      1-16      * |           X
  [6]     chr1      1-16      * |           Z
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However: (1) that's not always going to be the case, and (2) experience has shown that users are prompt to assume that a metadata column can be used as a unique range identifier when it's actually unsafe to do so.

H.

hpages commented 6 years ago

@lawremi What about having things the other way around for (3)?

lawremi commented 6 years ago

Many people find the current strand-specific behavior to be surprising, so we are trying it the other way. Obviously, this is not the only inconsistency, and the inconsistencies will surprise some, but again, we're ignoring that for now, because it would just be too constraining otherwise.

Your comments will be very helpful. I'm going to chat with Stuart about them later today.

sa-lee commented 6 years ago

Thanks for your feedback @hpages - I'll definitely fix up your points (1) and (2). Regarding the overlaps the behavior is expected but agree that the documentation is unclear and in the case you are talking about a user would probably want to use group_by_overlaps(). I'll try and fix that up.

hpages commented 6 years ago

When genomic features are stranded (e.g. exons), they are represented with a stranded GRanges (i.e. a GRanges where the strand is + or -). If they are not (e.g. genomic bins), they are represented with an unstranded GRanges (i.e. the strand is *). By default findOverlaps() will not consider that an exon on the plus strand overlaps with a transcript on the minus strand, which is the right thing to do. But it will report that my exon overlaps with some of the bins in the subject, whatever the strand of my exon is, which is also the right thing to do. Maybe that surprises some (I don't know about "many"), but there will always be people to be surprised by something. What is guaranteed is that using the opposite default for ignore.strand will introduce even more surprise and confusion.

Said otherwise: If I'm feeding findOverlaps() with a stranded query and subject, there is no reason why findOverlaps() would assume that I made a mistake and that I meant to give it unstranded objects. This is like the data.frame() constructor in base R assuming that I meant to give it a factor when I gave it a character vector. Not the greatest feature.

The North said: "we want ignore.strand=FALSE". Now the South is saying "we want ignore.strand=TRUE". Bioconductor wants to remain a united nation.

lawremi commented 6 years ago

The simpler overlap constraint is the more intuitive default. The explicit function naming will make the behavior more obvious and minimize confusion.

hpages commented 6 years ago

My ranges are directed. Why do I need to call find_overlaps_directed to operate on them? Why can't find_overlaps see that my ranges are directed and treat them as such? Sorry but I don't find this intuitive.

Also what's intuitive about this:

> intersect_ranges(GRanges("chr2:11-30:+"), GRanges("chr2:21-35:-"))
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr2     21-30      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The 1st feature is on the plus strand only. The 2nd feature is on the minus strand only. So they don't intersect.

lawremi commented 6 years ago

I think we can agree that we have different opinions of what is intuitive. The real issue is whether the inconsistency is OK. Again, this package is exploring how we can improve on the current API, so conflicts are inevitable.

hpages commented 6 years ago

The output of intersect_ranges is wrong. It's not an opinion.

Inconsistencies are not OK when they can easily be avoided. Isn't that the case here? What about exploring how you can improve on the current API without introducing this conflict? That will instantly increase a lot the value of your offer at a very small cost.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

558b886 first pass anchor restructuring 38f6fea to IntegerRanges for Bioc3.7 d5e0db0 clarify overlaps documentation 4f4d63a delegating anchors class 65135dd code tidy up 4bb648e typo fix 738625b make sure class can be created on the fly 786c23a fixes bam tests 67d74f0 version bump

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

sa-lee commented 6 years ago

@lshep I've incorporated the suggestions above - is there anything else I should modify or change?

lawremi commented 6 years ago

Except of course for the ignore.strand thing ;)

lshep commented 6 years ago

I will take a look at everything in the next day or two. Thank you.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

12dfe97 refactor of AnchoredRanges 9ffab15 refactoring of GroupedRanges c2cb2c6 modifications to dplyr verbs based on group_by ref... d53ebf4 updates for delegating classes 909dfa5 changes for group_by refactoring 4df03a9 namespace changes for grouping and delegating 44787e6 namespace fixes; version bump 0e88638 split groups function no longer required

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

6a1cafd import methods::as 1c8c6c5 and namespace

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

lshep commented 6 years ago

The code in general is in very good shape and well documented. My only note is that there is still a lot of repeated code that could (and should) be avoided.

An example in dplyr-select.R select.DelegatingIntegerRanges and select.DelegatingGenomicRanges is the exact same code. We would urge to do something like keep select.DelegatingGenomicRanges as is and then redefine select.DelegatingIntegerRanges as select.DelegatingIntegerRanges <- select.DelegatingGenomicRanges

Another example with the assist from Martin - the dplyr-mutate.R mutate.GroupedIntegerRanges and mutate.GroupedGenomicRanges is the same code except in the new() call - so a rewrite similar to something of the like where the majority of code is only in one function

.mutate <- function(.data, ..., NEW) {
  dots <- quos(...)
  delegate <- .data@delegate
  inx <- .data@inx
  rng <- lapply(inx, function(i) {
    x <- delegate[i]
    mutate_rng(x, dots)

  })

  rng <- bind_ranges(rng)[order(unlist(inx))]

  NEW(.data, rng, inx)
}

.new.GroupedGenomicRanges <- function(.data, rng, inx)
  new("GroupedGenomicRanges",
      elementMetadata =  .data@elementMetadata, 
      delegate = rng, 
      groups =  groups(.data), 
      inx = inx)

.new.GroupedIntegerRanges <- function(.data, rng, inx)
  new("GroupedIntegerRanges",
      delegate = rng, 
      groups =  groups(.data), 
      inx = inx)

#' @rdname mutate-ranges
#' @method mutate GroupedGenomicRanges
#' @export
mutate.GroupedGenomicRanges <- function(.data, ...)
  .mutate(.data, ..., NEW = .new.GroupedGenomicRanges)

#' @rdname mutate-ranges
#' @method mutate GroupedIntegerRanges
#' @export
mutate.GroupedIntegerRanges <- function(.data, ...) {
  .mutate(.data, ..., NEW = .new.GroupedIntegerRanges)

Repeated code should be avoided - it makes code difficult to read and also leaves opportunity to have to fix or modify code in multiple locations. This should be applied throughout the R code however you deem best.

Cheers, Lori

lawremi commented 6 years ago

Maybe it could even simpler. It's not clear why GroupedGenomicRanges needs to populate its @elementMetadata. Can't that just be left empty, with the metadata being stored by the delegate? In other words, there may be no need for NEW=; instead, just use initialize().

sa-lee commented 6 years ago

Thanks for the suggestions @lshep! I'll get onto it today. @lawremi - I needed to populate the elementMetadata slot because otherwise I got a validObject error .

lawremi commented 6 years ago

The problem is that GenomicRanges recently came to assert DataFrame (disallowing NULL) for the type of @elementMetadata, so we get an illegal, empty DataFrame from the prototype. Since every Vector will have @elementMetadata, there is basically no clean way to make a delegating Vector derivative. In the devel S4Vectors, I changed the validity function to call mcols(x) instead of using direct slot access. I do like keeping validity checks at the slot level, but since there is no "pure virtual" Vector analog, this is a reasonable compromise.

sa-lee commented 6 years ago

Ok that makes sense - I'll update with an initialize method then.

bioc-issue-bot commented 6 years ago

Received a valid push; starting a build. Commits are:

bd3adc6 reducing code duplication; requires S4Vectors >= 0... 7a554d7 reduce code repetition 840faeb version bump 483869e update namespace b4d8595 significant refactoring to reduce code duplication 60dbd49 fixes and refactoring for reduce methods

sa-lee commented 6 years ago

I've refactored quite a lot of the code now to reduce duplication. Thanks for your suggestions @lshep!

bioc-issue-bot commented 6 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 6 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.

Thank you for contributing to Bioconductor!

mtmorgan commented 6 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/sa-lee.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using biocLite(\"plyranges\"). The package 'landing page' will be created at

https://bioconductor.org/packages/plyranges

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.