Closed d-cameron closed 5 years ago
Hi @d-cameron
Thanks for submitting your package. We are taking a quick look at it and you will hear back from us soon.
The DESCRIPTION file for this package is:
Package: StructuralVariantAnnotation
Type: Package
Title: Variant annotations for structural variants
Version: 0.99.0
Date: 2019-04-05
Authors@R: c(
person("Daniel", "Cameron", email="daniel.l.cameron@gmail.com", role=c("aut", "cre"), comment=c(ORCID = "0000-0002-0951-7116")),
person("Ruining", "Dong", email="dong.rn@wehi.edu.au", role=c("aut"), comment=c(ORCID = "0000-0003-1433-0484")))
Description: StructuralVariantAnnotation contains useful helper
functions for dealing with structural variants in VCF format.
The packages contains functions for parsing VCFs from a number
of popular callers as well as functions for dealing with
breakpoints involving two separate genomic loci encoded as
GRanges objects.
License: GPL-3
Depends:
GenomicRanges,
rtracklayer,
VariantAnnotation,
BiocGenerics
Imports:
assertthat,
Biostrings,
stringr,
dplyr
Suggests:
BSgenome.Hsapiens.UCSC.hg19,
devtools,
testthat,
roxygen2,
covr,
knitr,
plyranges,
dplyr,
ggbio,
biovizBase,
circlize
RoxygenNote: 6.1.1
VignetteBuilder: knitr
biocViews: DataImport, Sequencing, Annotation, Genetics, VariantAnnotation
Add SSH keys to your GitHub account. SSH keys will are used to control access to accepted Bioconductor packages. See these instructions to add SSH keys to your GitHub account.
A reviewer has been assigned to your package. Learn what to expect during the review process.
IMPORTANT: Please read the instructions for setting up a push hook on your repository, or further changes to your repository will NOT trigger a new build.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hmm. What's the best way to proceed when my package builds without error/warning when build against release dependencies, but is failing in devel due to dependencies? In this case, a test/example VCF parses in release, but I get a invalid class "VCFHeader" object: 'info(VCFHeader)' must be a 3 column DataFrame with names Number, Type, Description
when I call VariantAnnotation::readVcf()
in the vignette.
Hi @d-cameron ,
Thanks for this submission.
Please focus on BioC devel. There have been major changes to Rsamtools and VariantAnnotation between BioC 3.8 (release) and 3.9 (devel) that could explain these differences. Also let's ignore Windows for now: there is a known Windows-specific issue with the latest Rsamtools and VariantAnnotation in devel that is still under investigation.
Best, H.
The overlap functionality could greatly benefit from using features of the Hits objects, instead of reducing to data.frames and resorting to dplyr. For example, in findBreakpointOverlaps()
, comments mention that duplicated(hits)
takes too much memory. Is that just because hits is a data.frame? The duplicated,Hits()
method is very efficient. I think intersect(x_hits, partner_hits)
would be a better approach.
Received a valid push; starting a build. Commits are:
b67d361 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
269c5ce remove test file bugs d813bba version bump Merge branch 'master' of https://git...
Received a valid push; starting a build. Commits are:
9494b6b version bump 0.99.3
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
@lawremi that also had the advantage of the of the return type matching findOverlaps()
which is a nicer design so I've refactored that code.
Attempting to do the obvious and concat the two hits objects fails with S4Vectors::c()
fails with the error:
Error in validObject(ans) :
invalid class “SortedByQueryHits” object: 'queryHits(x)' must be sorted
Is this something I should raise an issue for? I would have expected either a zipper merger returning a SortedByQueryHits, or the return class to be demoted to Hits
since the concatenated sequences are not in order.
Judging by performance in profvis, neither intersect()
nor duplicated()
have any optimisations for the sorted nature of the SortedByQueryHits
returned by findOverlaps()
(which is somewhat strange given that duplicated()
requires the Hits to be queryHits sorted). intersect()
so I just went with intersect
as it's actually the operation that I'm attempting to perform.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, skipped, TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS, skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
12bb692 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
8b6571c add documentation of S4 methods
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
09ac575 version bump
Received a valid push; starting a build. Commits are:
0cb9627 remove warnings and notes
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
a60259f remove NOTE on undefined global variables
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR, TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
e8def42 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
3e01a82 edit gitignore file 76d630d Merge branch 'master' of https://github.com/Papenf... 6cc7c28 recompile documents with value sections 6f8daf4 ver bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT, WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
10b9899 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "WARNINGS". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
c6a1e8c ver bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Hi @lawremi @hpages , we have made changes on examples, vignettes and test files to address the last timeout/warning issue. It now passes on windows but remains checktime > 10min on linux and osx. The check time on my local mac machine is ~4min. We will try and improve the issue further, but meanwhile, could we proceed with the review with the timeout warning present?
Thanks, Ruining
Received a valid push; starting a build. Commits are:
966b324 version bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "TIMEOUT". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
a94652c checktime test
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
Hi @d-cameron ,
Glad to see that StructuralVariantAnnotation finally builds and passes check with no timeout or error on all platforms. I took a first look at the package and have some feedback. See below.
Best, H.
I'm surprised by your choice to use a breakpoint GRanges (i.e. a GRanges object with a partner
metadata column) to represent a set of structural variants with 2 breakends. The natural choice for representing compound genome features in Bioconductor (a.k.a. disjoint genome features in bedtools terminology) is to use a GRangesList object. For example, loading a BEDPE file in a GRangeList object can simply be done with:
library(StructuralVariantAnnotation)
bedpe.file <- system.file("extdata", "gridss.bedpe", package = "StructuralVariantAnnotation")
pairs <- rtracklayer::import(bedpe.file)
names(pairs) <- mcols(pairs)$name
mcols(pairs)$name <- NULL
mcols(first(pairs)) <- mcols(second(pairs)) <- mcols(pairs)
grl <- zipup(pairs)
grl
# GRangesList object of length 2:
# $gridss2o
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <numeric>
# [1] chr1 18992158 + | 55
# [2] chr12 84963533 - | 55
#
# $gridss39o
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | score
# [1] chr12 84350 - | 627.96
# [2] chr12 4886681 + | 627.96
#
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
It's easy to extract the 1st and 2nd genomic loci from this kind of object:
unlist(heads(grl, n=1))
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <numeric>
# gridss2o chr1 18992158 + | 55
# gridss39o chr12 84350 - | 627.96
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
unlist(tails(grl, n=1))
# GRanges object with 2 ranges and 1 metadata column:
# seqnames ranges strand | score
# <Rle> <IRanges> <Rle> | <numeric>
# gridss2o chr12 84963533 - | 55
# gridss39o chr12 4886681 + | 627.96
# -------
# seqinfo: 2 sequences from an unspecified genome; no seqlengths
Generally speaking, it's easier to work with this kind of object than with a breakpoint GRanges. I would strongly suggest that you consider using this instead of breakpoint GRanges objects for representing structural variants with 2 breakends.
I would have expected that calling breakpointgr2bedpe()
on the breakpoint GRanges returned by bedpe2breakpointgr()
would work:
breakpointgr2bedpe(bedpe2breakpointgr(bedpe.file))
# Error in data.frame(chrom1 = GenomeInfoDb::seqnames(gr), start1 = start(gr) - :
# arguments imply differing number of rows: 4, 0
A common practice is to document opposite functions like bedpe2breakpointgr()
and breakpointgr2bedpe()
in the same man page. If for whatever reason you prefer to keep them in separate man pages, at least the 2 man pages should be cross-linked via a \seealso
section in each page.
The code chunk in the Installation section shows how to install the package, not how to load it. However the text above the code chunk says:
The StructuralVariationAnnotation package can be loaded from Bioconductor as follows:
So it would need to be modified to say something like:
The StructuralVariationAnnotation package can be installed from Bioconductor as follows:
Instead of:
Details of `VCF` objects can be found by `browseVignettes("VariantAnnotation")`.
please consider saying something like:
More information about VCF objects can be found by consulting the vignettes in
the VariantAnnotation package (with `browseVignettes("VariantAnnotation")`).
Maybe it would be worth pointing the reader to some fields of interests (e.g. MATEID
, SVTYPE
, etc...) after displaying the vcf
object. Also maybe comment some of these fields.
Your findBreakpointOverlaps()
and countBreakpointOverlaps()
examples (in the "Exploring breakpoints" section) produce no overlaps so are not illustrative. Also please note that findBreakpointOverlaps()
returns a Hits object, not a matrix.
Please remove the parentheses in Alternatively, plotting package such as ggbio() provides...
(ggbio is a package, not a function).
This code:
region.gr <- gr[2:3,] %>% dplyr::mutate(end=end+5) %>% dplyr::mutate(start=start+5)
is unnecessarily complicated and hard to read. The same can be achieved with just:
region.gr <- shift(gr[2:3,], 5)
In addition to being more readable, the shift()
solution is also more efficient (36x faster on my laptop):
library(microbenchmark)
gr0 <- gr[2:3]
microbenchmark(gr0 %>% dplyr::mutate(end=end+5) %>% dplyr::mutate(start=start+5), shift(gr0, 5))
# Unit: milliseconds
# expr
# gr0 %>% dplyr::mutate(end = end + 5) %>% dplyr::mutate(start = start + 5)
# shift(gr0, 5)
# min lq mean median uq max neval
# 60.207715 62.678722 64.550741 64.024463 64.970522 92.749330 100
# 1.435635 1.739248 1.761021 1.750163 1.779301 1.963865 100
Similar problem with this code:
gr.circos <- gr %>% mutate(to.gr=GRanges(seqnames(partner(gr)),ranges(partner(gr))))
This is better done with:
gr.circos <- gr
mcols(gr.circos)$to.gr <- granges(partner(gr))
Using granges(partner(gr))
instead of GRanges(seqnames(partner(gr)),ranges(partner(gr)))
also has the benefit to preserve the strand.
Spelling:
from a number of popular caller
(callers)We can then plot the breakpoints agains referece genomes
(against, reference)Received a valid push; starting a build. Commits are:
cb584be Ver bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
Congratulations! The package built without errors or warnings on all platforms.
Please see the build report for more details.
- I'm surprised by your choice to use a breakpoint GRanges (i.e. a GRanges object with a
partner
metadata column) to represent a set of structural variants with 2 breakends. The natural choice for representing compound genome features in Bioconductor (a.k.a. disjoint genome features in bedtools terminology) is to use a GRangesList object.
This is actually a core reason why I'm submitting this package: all four existing pair-of-genomic-coordinate data structures in BioConductor have two parallel first-in-pair and second-in-pair structures of the same type (GRanges or something convertible to this). When doing SV analysis, I found these formats quite painful to deal with as it results in a whole lot of code duplication as the vast majority of SV annotation that one actually uses is actually breakend-level annotations.
Fundamentally, there is no intrinsic 'first' and 'second' breakend for a breakpoint - both have equal weighting and the choice of which breakend gets allocated to the first is arbitrary. I believe that arbitrarily grouping break-ends into two different subsets is the incorrect abstraction. Two variants that disrupt TP53 with identical breakend position and orientation should not have to be treated differently just because the partner of one of the occurs 'before' TP53 and the other occurs 'after'.
The GRanges approach is the closest data structure match to SV record in VCF BND notation. I've advocated strongly for the BND format in the GAG4H file formats working group (of which I am a member) as I've found that the 'traditional' VCF symbolic allele SVTYPE
s (DEL
, INS,
DUP`, ...) are only useful for extremely simple analysis and as soon as you start delving into more complicated analysis, you end up converting everything to breakend-focused BND-like notation.
The second reason I have chosen this format is that is more naturally support single breakend variants (as defined on Section 5.4.9 of the VCF specifications). The data structure chosen for this package allows both breakpoint variants and single breakend variants to be combined with minimal hacks - a NA
partner is way easier to deal with than missing elements in a second-in-pair data structure.
Examples of use cases that are more naturally represented in a GRanges object that I have actually implemented in my variant projects include:
It's easy to extract the 1st and 2nd genomic loci from this kind of object:
In the three years since I first developed this package, the only time I've ever wanted to extract a set of first or second genomic loci has been when having to flatten for export to a format. All the actual work has been symmetrical in that every breakend is treated equally with no intrinsic separation.
I would have expected that calling breakpointgr2bedpe() on the breakpoint GRanges returned by bedpe2breakpointgr() would work:
Compatibility with rtracklayer's notation sounds like it would be a good idea. I'll look into replacing these functions with data conversions to/from the GRangeList of pairs used by rtracklayer which means this package won't have to worry about BEDPE import/export. Interoperability with existing BioConductor notations is definitely a desirable feature.
Your findBreakpointOverlaps() and countBreakpointOverlaps() examples (in the "Exploring breakpoints" section) produce no overlaps so are not illustrative.
I've replaced this with an simplified example of the comparison logic I used for my SV caller benchmarking paper.
Also please note that findBreakpointOverlaps() returns a Hits object, not a matrix.
We made this change due to feedback from @lawremi but didn't update all our documentation. Sorry about that.
Thanks for the review. I'm hoping to complete the changes addressing all points within the next couple of days.
Received a valid push; starting a build. Commits are:
5f9ba7d Ver bump
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "skipped, ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Received a valid push; starting a build. Commits are:
27854d0 Refactored BEDPE import/export logic; expanded vig... f3347c7 Merge branch 'master' of https://github.com/Papenf... 43cc014 Version increment
Dear Package contributor,
This is the automated single package builder at bioconductor.org.
Your package has been built on Linux, Mac, and Windows.
On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.
Please see the build report for more details.
Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor
Confirm the following by editing each check box to '[x]'
[ X ] I understand that by submitting my package to Bioconductor, the package source and all review commentary are visible to the general public.
[ X ] I have read the Bioconductor Package Submission instructions. My package is consistent with the Bioconductor Package Guidelines.
[X ] I understand that a minimum requirement for package acceptance is to pass R CMD check and R CMD BiocCheck with no ERROR or WARNINGS. Passing these checks does not result in automatic acceptance. The package will then undergo a formal review and recommendations for acceptance regarding other Bioconductor standards will be addressed.
[ X ] My package addresses statistical or bioinformatic issues related to the analysis and comprehension of high throughput genomic data.
[ X ] I am committed to the long-term maintenance of my package. This includes monitoring the support site for issues that users may have, subscribing to the bioc-devel mailing list to stay aware of developments in the Bioconductor community, responding promptly to requests for updates from the Core team in response to changes in R or underlying software.
I am familiar with the essential aspects of Bioconductor software management, including:
For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.