Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

StructuralVariantAnnotation #1079

Closed d-cameron closed 5 years ago

d-cameron commented 5 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 5 years ago

Hi @d-cameron

Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.

The DESCRIPTION file for this package is:

Package: StructuralVariantAnnotation
Type: Package
Title: Variant annotations for structural variants
Version: 0.99.0
Date: 2019-04-05
Authors@R: c(
    person("Daniel", "Cameron", email="daniel.l.cameron@gmail.com", role=c("aut", "cre"), comment=c(ORCID = "0000-0002-0951-7116")),
    person("Ruining", "Dong", email="dong.rn@wehi.edu.au", role=c("aut"), comment=c(ORCID = "0000-0003-1433-0484")))
Description: StructuralVariantAnnotation contains useful helper
    functions for dealing with structural variants in VCF format.
    The packages contains functions for parsing VCFs from a number
    of popular callers as well as functions for dealing with 
    breakpoints involving two separate genomic loci encoded as
    GRanges objects.
License: GPL-3
Depends:
    GenomicRanges,
    rtracklayer,
  VariantAnnotation,
  BiocGenerics
Imports:
    assertthat,
    Biostrings,
    stringr,
    dplyr
Suggests:
    BSgenome.Hsapiens.UCSC.hg19,
    devtools,
    testthat,
    roxygen2,
    covr,
    knitr,
    plyranges,
    dplyr,
    ggbio,
    biovizBase,
    circlize
RoxygenNote: 6.1.1
VignetteBuilder: knitr
biocViews: DataImport, Sequencing, Annotation, Genetics, VariantAnnotation

Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.

bioc-issue-bot commented 5 years ago

A reviewer has been assigned to your package. Learn what to expect during the review process.

IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

d-cameron commented 5 years ago

Hmm. What's the best way to proceed when my package builds without error/warning when build against release dependencies, but is failing in devel due to dependencies? In this case, a test/example VCF parses in release, but I get a invalid class "VCFHeader" object: 'info(VCFHeader)' must be a 3 column DataFrame with names Number, Type, Description when I call VariantAnnotation::readVcf() in the vignette.

hpages commented 5 years ago

Hi @d-cameron ,

Thanks for this submission.

Please focus on BioC devel. There have been major changes to Rsamtools and VariantAnnotation between BioC 3.8 (release) and 3.9 (devel) that could explain these differences. Also let's ignore Windows for now: there is a known Windows-specific issue with the latest Rsamtools and VariantAnnotation in devel that is still under investigation.

Best, H.

lawremi commented 5 years ago

The overlap functionality could greatly benefit from using features of the Hits objects, instead of reducing to data.frames and resorting to dplyr. For example, in findBreakpointOverlaps(), comments mention that duplicated(hits) takes too much memory. Is that just because hits is a data.frame? The duplicated,Hits() method is very efficient. I think intersect(x_hits, partner_hits) would be a better approach.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

b67d361 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

269c5ce remove test file bugs d813bba version bump Merge branch 'master' of https://git...

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

9494b6b version bump 0.99.3

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

7420bba update examples d5e4e0c update examples and version bump 79b93c6 update package documentation Merge branch 'master...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

4b34ffc findBreakpointOverlaps() now returns a Hits object 17b4c8a Version bump

d-cameron commented 5 years ago

@lawremi that also had the advantage of the of the return type matching findOverlaps() which is a nicer design so I've refactored that code.

Attempting to do the obvious and concat the two hits objects fails with S4Vectors::c() fails with the error:

Error in validObject(ans) : 
  invalid class “SortedByQueryHits” object: 'queryHits(x)' must be sorted

Is this something I should raise an issue for? I would have expected either a zipper merger returning a SortedByQueryHits, or the return class to be demoted to Hits since the concatenated sequences are not in order.

Judging by performance in profvis, neither intersect() nor duplicated() have any optimisations for the sorted nature of the SortedByQueryHits returned by findOverlaps() (which is somewhat strange given that duplicated() requires the Hits to be queryHits sorted). intersect() so I just went with intersect as it's actually the operation that I'm attempting to perform.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

3c217b1 Removed dplyr and stringr from NAMESPACE imports d... a99ccfc Removing building warning

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

12bb692 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

8b6571c add documentation of S4 methods

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

09ac575 version bump

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

0cb9627 remove warnings and notes

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

a60259f remove NOTE on undefined global variables

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR, TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

e8def42 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

3e01a82 edit gitignore file 76d630d Merge branch 'master' of https://github.com/Papenf... 6cc7c28 recompile documents with value sections 6f8daf4 ver bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

10b9899 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

c6a1e8c ver bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

jackieduckie commented 5 years ago

Hi @lawremi @hpages , we have made changes on examples, vignettes and test files to address the last timeout/warning issue. It now passes on windows but remains checktime > 10min on linux and osx. The check time on my local mac machine is ~4min. We will try and improve the issue further, but meanwhile, could we proceed with the review with the timeout warning present?

Thanks, Ruining

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

966b324 version bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

a94652c checktime test

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

hpages commented 5 years ago

Hi @d-cameron ,

Glad to see that StructuralVariantAnnotation finally builds and passes check with no timeout or error on all platforms. I took a first look at the package and have some feedback. See below.

Best, H.

General

  1. I'm surprised by your choice to use a breakpoint GRanges (i.e. a GRanges object with a partner metadata column) to represent a set of structural variants with 2 breakends. The natural choice for representing compound genome features in Bioconductor (a.k.a. disjoint genome features in bedtools terminology) is to use a GRangesList object. For example, loading a BEDPE file in a GRangeList object can simply be done with:

    library(StructuralVariantAnnotation)
    bedpe.file <- system.file("extdata", "gridss.bedpe", package = "StructuralVariantAnnotation")
    pairs <- rtracklayer::import(bedpe.file)
    names(pairs) <- mcols(pairs)$name
    mcols(pairs)$name <- NULL
    mcols(first(pairs)) <- mcols(second(pairs)) <- mcols(pairs)
    grl <- zipup(pairs)
    grl
    # GRangesList object of length 2:
    # $gridss2o 
    # GRanges object with 2 ranges and 1 metadata column:
    #       seqnames    ranges strand |     score
    #          <Rle> <IRanges>  <Rle> | <numeric>
    #   [1]     chr1  18992158      + |        55
    #   [2]    chr12  84963533      - |        55
    # 
    # $gridss39o 
    # GRanges object with 2 ranges and 1 metadata column:
    #       seqnames  ranges strand |  score
    #   [1]    chr12   84350      - | 627.96
    #   [2]    chr12 4886681      + | 627.96
    #
    # -------
    # seqinfo: 2 sequences from an unspecified genome; no seqlengths

    It's easy to extract the 1st and 2nd genomic loci from this kind of object:

    unlist(heads(grl, n=1))
    # GRanges object with 2 ranges and 1 metadata column:
    #             seqnames    ranges strand |     score
    #                <Rle> <IRanges>  <Rle> | <numeric>
    #    gridss2o     chr1  18992158      + |        55
    #   gridss39o    chr12     84350      - |    627.96
    #   -------
    #  seqinfo: 2 sequences from an unspecified genome; no seqlengths
    
    unlist(tails(grl, n=1))
    # GRanges object with 2 ranges and 1 metadata column:
    #             seqnames    ranges strand |     score
    #                <Rle> <IRanges>  <Rle> | <numeric>
    #    gridss2o    chr12  84963533      - |        55
    #   gridss39o    chr12   4886681      + |    627.96
    #   -------
    #   seqinfo: 2 sequences from an unspecified genome; no seqlengths

    Generally speaking, it's easier to work with this kind of object than with a breakpoint GRanges. I would strongly suggest that you consider using this instead of breakpoint GRanges objects for representing structural variants with 2 breakends.

  2. I would have expected that calling breakpointgr2bedpe() on the breakpoint GRanges returned by bedpe2breakpointgr() would work:

    breakpointgr2bedpe(bedpe2breakpointgr(bedpe.file))
    # Error in data.frame(chrom1 = GenomeInfoDb::seqnames(gr), start1 = start(gr) -  : 
    #   arguments imply differing number of rows: 4, 0
  3. A common practice is to document opposite functions like bedpe2breakpointgr() and breakpointgr2bedpe() in the same man page. If for whatever reason you prefer to keep them in separate man pages, at least the 2 man pages should be cross-linked via a \seealso section in each page.

Vignette

  1. The code chunk in the Installation section shows how to install the package, not how to load it. However the text above the code chunk says:

    The StructuralVariationAnnotation package can be loaded from Bioconductor as follows:

    So it would need to be modified to say something like:

    The StructuralVariationAnnotation package can be installed from Bioconductor as follows:
  2. Instead of:

    Details of `VCF` objects can be found by `browseVignettes("VariantAnnotation")`.

    please consider saying something like:

    More information about VCF objects can be found by consulting the vignettes in
    the VariantAnnotation package (with `browseVignettes("VariantAnnotation")`).
  3. Maybe it would be worth pointing the reader to some fields of interests (e.g. MATEID, SVTYPE, etc...) after displaying the vcf object. Also maybe comment some of these fields.

  4. Your findBreakpointOverlaps() and countBreakpointOverlaps() examples (in the "Exploring breakpoints" section) produce no overlaps so are not illustrative. Also please note that findBreakpointOverlaps() returns a Hits object, not a matrix.

  5. Please remove the parentheses in Alternatively, plotting package such as ggbio() provides... (ggbio is a package, not a function).

  6. This code:

    region.gr <- gr[2:3,] %>% dplyr::mutate(end=end+5) %>% dplyr::mutate(start=start+5)

    is unnecessarily complicated and hard to read. The same can be achieved with just:

    region.gr <- shift(gr[2:3,], 5)

    In addition to being more readable, the shift() solution is also more efficient (36x faster on my laptop):

    library(microbenchmark)
    gr0 <- gr[2:3]
    microbenchmark(gr0 %>% dplyr::mutate(end=end+5) %>% dplyr::mutate(start=start+5), shift(gr0, 5))
    # Unit: milliseconds
    #                                                                            expr
    #  gr0 %>% dplyr::mutate(end = end + 5) %>% dplyr::mutate(start = start +      5)
    #                                                                   shift(gr0, 5)
    #        min        lq      mean    median        uq       max neval
    #  60.207715 62.678722 64.550741 64.024463 64.970522 92.749330   100
    #   1.435635  1.739248  1.761021  1.750163  1.779301  1.963865   100
  7. Similar problem with this code:

    gr.circos <- gr %>% mutate(to.gr=GRanges(seqnames(partner(gr)),ranges(partner(gr))))

    This is better done with:

    gr.circos <- gr
    mcols(gr.circos)$to.gr <- granges(partner(gr))

    Using granges(partner(gr)) instead of GRanges(seqnames(partner(gr)),ranges(partner(gr))) also has the benefit to preserve the strand.

  8. Spelling:

    • from a number of popular caller (callers)
    • We can then plot the breakpoints agains referece genomes (against, reference)
bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

cb584be Ver bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

d-cameron commented 5 years ago
  1. I'm surprised by your choice to use a breakpoint GRanges (i.e. a GRanges object with a partner metadata column) to represent a set of structural variants with 2 breakends. The natural choice for representing compound genome features in Bioconductor (a.k.a. disjoint genome features in bedtools terminology) is to use a GRangesList object.

This is actually a core reason why I'm submitting this package: all four existing pair-of-genomic-coordinate data structures in BioConductor have two parallel first-in-pair and second-in-pair structures of the same type (GRanges or something convertible to this). When doing SV analysis, I found these formats quite painful to deal with as it results in a whole lot of code duplication as the vast majority of SV annotation that one actually uses is actually breakend-level annotations.

Fundamentally, there is no intrinsic 'first' and 'second' breakend for a breakpoint - both have equal weighting and the choice of which breakend gets allocated to the first is arbitrary. I believe that arbitrarily grouping break-ends into two different subsets is the incorrect abstraction. Two variants that disrupt TP53 with identical breakend position and orientation should not have to be treated differently just because the partner of one of the occurs 'before' TP53 and the other occurs 'after'.

The GRanges approach is the closest data structure match to SV record in VCF BND notation. I've advocated strongly for the BND format in the GAG4H file formats working group (of which I am a member) as I've found that the 'traditional' VCF symbolic allele SVTYPEs (DEL, INS,DUP`, ...) are only useful for extremely simple analysis and as soon as you start delving into more complicated analysis, you end up converting everything to breakend-focused BND-like notation.

The second reason I have chosen this format is that is more naturally support single breakend variants (as defined on Section 5.4.9 of the VCF specifications). The data structure chosen for this package allows both breakpoint variants and single breakend variants to be combined with minimal hacks - a NA partner is way easier to deal with than missing elements in a second-in-pair data structure.

Examples of use cases that are more naturally represented in a GRanges object that I have actually implemented in my variant projects include:

It's easy to extract the 1st and 2nd genomic loci from this kind of object:

In the three years since I first developed this package, the only time I've ever wanted to extract a set of first or second genomic loci has been when having to flatten for export to a format. All the actual work has been symmetrical in that every breakend is treated equally with no intrinsic separation.

I would have expected that calling breakpointgr2bedpe() on the breakpoint GRanges returned by bedpe2breakpointgr() would work:

Compatibility with rtracklayer's notation sounds like it would be a good idea. I'll look into replacing these functions with data conversions to/from the GRangeList of pairs used by rtracklayer which means this package won't have to worry about BEDPE import/export. Interoperability with existing BioConductor notations is definitely a desirable feature.

Your findBreakpointOverlaps() and countBreakpointOverlaps() examples (in the "Exploring breakpoints" section) produce no overlaps so are not illustrative.

I've replaced this with an simplified example of the comparison logic I used for my SV caller benchmarking paper.

Also please note that findBreakpointOverlaps() returns a Hits object, not a matrix.

We made this change due to feedback from @lawremi but didn't update all our documentation. Sorry about that.

Thanks for the review. I'm hoping to complete the changes addressing all points within the next couple of days.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

5f9ba7d Ver bump

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

27854d0 Refactored BEDPE import/export logic; expanded vig... f3347c7 Merge branch 'master' of https://github.com/Papenf... 43cc014 Version increment

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.