Bioconductor / Contributions

Contribute Packages to Bioconductor
135 stars 33 forks source link

StructuralVariantAnnotation #1079

Closed d-cameron closed 5 years ago

d-cameron commented 5 years ago

Update the following URL to point to the GitHub repository of the package you wish to submit to Bioconductor

Confirm the following by editing each check box to '[x]'

I am familiar with the essential aspects of Bioconductor software management, including:

For help with submitting your package, please subscribe and post questions to the bioc-devel mailing list.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

ad8fdef Fixed typo in sanity check; Updated README

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

hpages commented 5 years ago

Hi @d-cameron ,

Thanks for the feedback. Before you discard my suggestion of using a GRangesList object, I want to make sure that there is no misunderstanding.

About this: "all four existing pair-of-genomic-coordinate data structures in BioConductor have two parallel first-in-pair and second-in-pair structures of the same type (GRanges or something convertible to this)."

I'm not proposing to use a GRangesList object with two parallel first-in-pair and second-in-pair GRanges. What I'm proposing is to use a GRangesList object with one list element per breakpoint. In other words it's not about forming 2 groups (1 with the first breakends and 1 with the second breakends), it's about forming 1 group per breakpoint. See the grl object from my initial feedback. It has 2 list elements, 1 for the gridss2o breakpoint, and 1 for the gridss39o breakpoint. Each list element is a small GRanges object containing the 2 breakends. So the breakends are not arbitrarily grouped into two different subsets. Furthermore, the number of breakends for each breakpoint is not enforced and could be anything. This means that you can combine both breakpoint variants and single breakend variants in the same object. You could imagine storing breakpoint variants with 3 or more breakends if such thing make sense.

About your concern that "the vast majority of SV annotation that one actually uses is breakend-level annotations": Please note that GRangesList objects support top-level annotations (mcols(grl)) as well as inner annotations (mcols(grl, level="within")). This allows one to annotate at the breakpoint and/or breakend level. For example the grl object from my initial feedback is annotated at the breakend level (score inner metadata column -- even though it could be that the score should be considered a breakpoint attribute rather than a breakend attribute so would be better stored at the breakpoint level). You don't have the ability to easily annotate at the 2 levels (breakpoint and breakend) with a flat breakpoint GRanges object.

Another major advantage of the GRangesList approach is that it makes the grouping more robust. With the flat breakpoint GRanges approach it's easy for a breakend to loose its partner e.g. when the user subsets the object or accidentally removes its names.

FWIW the GRangesList container was specifically designed for representing compound genome features and was originally introduced to support the "exons grouped by gene" and "exons grouped by transcript" use cases (as returned by GenomicFeatures::exonsBy(txdb, by="gene") and GenomicFeatures::exonsBy(txdb, by="tx"), respectively). It has proven to be very convenient for this.

Hopefully this clarifies my proposal and addresses your concerns.

Best, H.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

d9edfb2 Merged pairs2breakpointgr and breakpointgr2pairs d...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

d-cameron commented 5 years ago

GRangesList container was specifically designed for representing compound genome features and was originally introduced to support the "exons grouped by gene" You could imagine storing breakpoint variants with 3 or more breakends if such thing make sense.

That actually does make quite a bit of sense as the current design does not handle breakpoints with multiple partners which are in the VCF specs (even if 0 tools use them)

The problem is that the GRangeList approach also wouldn't handle such events. If the list item has A, B, and C, there's no encoding which breakends correspond with which partners. Is it {AB, AC}, {AB, AC, BC}, or {AC, BC}? An data structure designed for exon lists doesn't quite work for this use case. There is a fundamental pairing of breakends within breakpoints and I've tried to take the middle ground of not splitting breakends into arbitrary first and second sets, but also keeping the pairing information (relatively) tightly coupled.

The other thing that is really annoying is that findOverlaps() ordinals are the list ordinals, not the row ordinals so you can't report which breakend corresponds with which breakend which breaks all the SV traversal use cases (or requires findBreakpointOverlaps() to report fundamentally different things than findOverlaps()).

Another major advantage of the GRangesList approach is that it makes the grouping more robust.

The cost of GRanges notation does come with some brittleness inherit in the design. I've added in more aggressive checks and explicit error message so users know exactly why their breakpoint GRanges is breaking. Data inconsistences/unsupported notations are cleaned up/filtered on data loading but it's are not strongly at the time of data manipulation, merely when interacting with this package.

With the flat breakpoint GRanges approach it's easy for a breakend to loose its partner e.g. when the user subsets the object

This is either a design flaw, or a feature depending on your use case. Filtering events to a subset of the genome is the typical operation where this occurs. Users will want to either retain or remove events that only have one breakend falling within the region of interest. For both GRanges and GRangesList designs, a simple filtering will result in the breakend that does not fall within the ROI to be remove but the other to be retained. The GRanges$partner approach causes subsequence breakpoint GRanges operations to throw a (now more friendly) error message. A GRangesList approach will silently convert the breakpoint into a single breakend which I think is definitely the wrong approach - we definitely should not change the type of event when filtering. In this situation I think a hard failure is better than a silent data corruption.

accidentally removes its names

Names are absolutely essential to VCF BND notation so I suspect my experience with writing an SV caller than reports in BND notation has baised my design quite heavily towards a name-centric design.

Before you discard my suggestion of using a GRangesList object You don't have the ability to easily annotate at the 2 levels (breakpoint and breakend) with a flat breakpoint GRanges object.

I agree that being able to annotate at both levels is indeed a significant feature.

My current plan of action is to try and get the (existing GRanges-centric) package to a state where (apart from your concerns with the core data structure), you're happy to pass review and accept the package by the deadline. I can see value in adding support for other notations in future releases and there's no reason why findBreakpointOverlaps() can't support Pairs and GRangesList notations in addition to the current one. In theory, mixed data structures could be supported and findBreakpointOverlaps(grl, pairs) would return the intuitive result and not an error.

Does that sound like a reasonable path forward to you?

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

846001a Added empty Description fields to Sudmunt VCF head...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

lawremi commented 5 years ago

If this problem does require a new data structure, ideally it would be a formal class and serve as infrastructure for other packages operating on structural variants.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

17b2158 Version increment

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

On one or more platforms, the build results were: "ERROR". This may mean there is a problem with the package that you need to fix. Or it may mean that there is a problem with the build system itself.

Please see the build report for more details.

d-cameron commented 5 years ago

@lawremi I experimented with a formal class GRanges early on but decided against it since there were so many use cases where I just wanted it act exactly like a GRanges object (including all the GRanges manipulation methods that return a GRanges object) that I just went with using an actual GRanges object and adding optional pairing associations on top of it.

I valued playing nicely with existing ecosystem over the rough edges of building on top of GRanges. There's nothing stopping us creating a formal GRanges-like (derived?) class if this notational style gains traction. Apart from rtracklayer BEDPE loading, there's not really much of a precedent in the BioConductor ecosystem for handles arbitrary SVs. There's a some packages (e.g instansv, chimeraviz) that handle the trivially simple events (DEL/INS/DUP/INV) that can be represented as chr:start-stop, or do some visualisation, but unlike SVs/indels, I'm not aware of any packages that can do the analysis you'd want to be able to do for arbitrary SVs (e.g. gene disruption/fusion detection, breakpoint and event classification).

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

077ea3a Rebuilt roxygen docs Added .travis.yml to build ig...

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

d-cameron commented 5 years ago

@hpages build now passing with your changes (hopefully) addressed.

lawremi commented 5 years ago

Extending GRanges a la VRanges (call it SVRanges?) might be a good idea.

d-cameron commented 5 years ago

@lawremi that would require adding hooks to everywhere that modifies a GRanges class so they also support the new class, correct? Wouldn't it be easier to modify GRanges itself to optionally track row pairings? Such as class would be useful in other contexts as well (e.g. InteractionSet).

lawremi commented 5 years ago

You can look at VRanges but basically it's not so bad, because GRanges is designed internally to be extensible. Many functions update the object without constructing a new GRanges instance.

lawremi commented 5 years ago

Actually maybe SVRanges should extend VRanges.

d-cameron commented 5 years ago

Actually maybe SVRanges should extend VRanges.

Unfortunately, most of the VRanges columns are not appropriate for arbitrary SVs so it's not a good fit. Even something as simple as equivalent of refCount and altCount are actually much more complicated for SVs (e.g. read pair support for smallish variants with either be wrong, or require a model to estimate the likelihood of a read pair coming from the ref or alt as the expected fragment size distributions overlap for small events).

Many functions update the object without constructing a new GRanges instance.

Thinking it over, extending from GRanges (or VRanges) is problematic as it violates Liskov Substitution Principle - the proposed SVRanges class would either restrict the permitted operation (thus making operations that were valid on a GRanges not valid on the derived class), or if you allow the operations, be effectively the same as the GRange implementation that I've implemented (raising errors when the GRanges violates the breakpoint pairing constraints).

hpages commented 5 years ago

@d-cameron

Yes that sounds like a reasonable path forward to me.

Some additional clarifications:

H.

lawremi commented 5 years ago

In many cases where the constraints would be broken by an exact endomorphism, one can just return the more general type. For example, intersect,GRanges,GRanges() always returns a GRanges instance, even if it receives a VRanges.

Principles like the LSP are good guides, but the ultimate test is whether the software is more useful or not. VRanges has proven empirically useful.

d-cameron commented 5 years ago

GRangesList was specifically designed for representing compound genome features With the GRangesList approach, in the situation of a list element with A, B, and C, couldn't the encoding of which breakends correspond to which partners be achieved via a breakend-level metadata column?

Which is essentially a List version for the partner implementation, which would have the same drawbacks that I have now. If we restrict ourselves to breakpoints and single breakend there's no problem and GRangesList would work well. I don't currently have any plan to support A,B,C style variants as they're only used by a) multi-mapping aware callers that don't have widespread acceptance, or b) my variant caller but for the purpose of this package, they're treated as single breakend.

Yes that sounds like a reasonable path forward to me.

We now have proposals for the package to also accept SVRanges, GRangeList, and Pairs objects. I'm not sure which (if any) need to be implemented now. Are there any blockers/required review changes that are preventing the acceptance of the package in its current form?

Thanks to both of you for the feedback and discussions.

hpages commented 5 years ago

@d-cameron If, as you said earlier, most of the VRanges columns are not appropriate for arbitrary SVs, extending it might not be suitable for your use case.

I don't think anybody suggested that the package should also accept Pairs objects.

I see that the last check by the automated single package builder was successful on all platforms. I'll take another look at the package later today and will let you know.

BTW you closed this issue. Was it intentional?

d-cameron commented 5 years ago

BTW you closed this issue. Was it intentional?

It was not.

hpages commented 5 years ago

Hi @d-cameron

Thanks for the changes.

There must have been a misunderstanding about item 2. I was only suggesting that breakpointgr2bedpe(bedpe2breakpointgr(bedpe.file)) should work. This is a reasonable expectation for functions that have symmetric names, which suggests that they perform opposite operations. I was not suggesting that you get rid of bedpe2breakpointgr() and that you introduce the breakpointgr2pairs() / pairs2breakpointgr() combo. Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

A few more things:

  1. Looks like you're using tidyverse in the vignette so the package needs to be added to the Suggests field. Note that you only need to load tidyverse once in the vignette. Furthermore AFAICS you only seem to be using the ggplot2 and dplyr packages from the tidyverse. For the sake of minimizing package dependencies, it would be preferable to Suggest these 2 packages only instead of tidyverse. In the vignette it's better to load them right before you use them (i.e. right before generating the Precision-Recall and ROC curves).

  2. Load the VariantAnnotation package before using it in the vignette, not after.

  3. Use library() (or require()) consistently in the vignette, instead of a mix of both.

  4. This code in the vignette:

    breakpointgr <- breakpointRanges(vcf)
    singleBreakendgr <- breakendRanges(vcf)
    gr <- c(breakpointgr, singleBreakendgr)

    concatenates breakpointgr with empty singleBreakendgr (both GRanges object). If that's intended, it deserves a comment. Ideally, an example where this concatenation is not a no-op would be more illustrative and would have more value. In particular it would show what happens to the partner metadata column and how the breakpointgr + singleBreakendgr blend is handled downstream.

  5. "Once we know which calls match the truth set, it is relatively straight-forward to generate Precision-Recall and ROC curves for each caller."

    ggplot(as.data.frame(svgr) %>%
      dplyr::select(QUAL, caller, truth_matches) %>%
      dplyr::group_by(caller, QUAL) %>%
      dplyr::summarise(
        calls=n(),
        tp=sum(truth_matches > 0)) %>%
      dplyr::group_by(caller) %>%
      dplyr::arrange(dplyr::desc(QUAL)) %>%
      dplyr::mutate(
        cum_tp=cumsum(tp),
        cum_n=cumsum(calls),
        cum_fp=cum_n - cum_tp,
        Precision=cum_tp / cum_n,
        Recall=cum_tp/length(truth_svgr))) +
      aes(x=Recall, y=Precision, colour=caller) +
      geom_point() +
      geom_line() +
      labs(title="NA12878 chr22 CREST and HYDRA\nSudmunt 2015 truth set")

    That doesn't look straight-forward to me.

  6. Please remove the parentheses in ggplot2() (ggplot2 is a package, not a function).

H.

d-cameron commented 5 years ago

There must have been a misunderstanding about item 2. Exposing the intermediate Pairs representation has little value. Plus now you've lost the convenience of being able to go from bedpe.file to the breakpoint GRanges object with a single function call.

Replacing the direct bedpe parsing with Pairs makes BEDPE parsing consistent with VCF parsing. Read/write of VCFs is performed entirely by VariantAnnotation, and with the Pairs functions, read/write of BEDPEs is performed entirely by rtracklayer. I feel that this is a cleaner design as the package is now consistent with how IO is handled and I'm no longer providing a parallel implementation of functionality already provided by rtracklayer.

bioc-issue-bot commented 5 years ago

Received a valid push; starting a build. Commits are:

69343dd Round 2 of review feedback.

d-cameron commented 5 years ago

12, 13, 14, 16, 17

Updated

15

I've added a real world (somatic COLO829) call set to the vignette and included a section on how to handle mixed single breakend/breakpoint GRanges. An added bonus is that the circos plots actually have interesting events in them now.

d-cameron commented 5 years ago

@hpages round two of feedback has been addressed

bioc-issue-bot commented 5 years ago

Dear Package contributor,

This is the automated single package builder at bioconductor.org.

Your package has been built on Linux, Mac, and Windows.

Congratulations! The package built without errors or warnings on all platforms.

Please see the build report for more details.

hpages commented 5 years ago

Thanks Daniel

bioc-issue-bot commented 5 years ago

Your package has been accepted. It will be added to the Bioconductor Git repository and nightly builds. Additional information will be posed to this issue in the next several days.

Thank you for contributing to Bioconductor!

mtmorgan commented 5 years ago

The master branch of your GitHub repository has been added to Bioconductor's git repository.

To use the git.bioconductor.org repository, we need an 'ssh' key to associate with your github user name. If your GitHub account already has ssh public keys (https://github.com/d-cameron.keys is not empty), then no further steps are required. Otherwise, do the following:

  1. Add an SSH key to your github account
  2. Submit your SSH key to Bioconductor

See further instructions at

https://bioconductor.org/developers/how-to/git/

for working with this repository. See especially

https://bioconductor.org/developers/how-to/git/new-package-workflow/ https://bioconductor.org/developers/how-to/git/sync-existing-repositories/

to keep your GitHub and Bioconductor repositories in sync.

Your package will be included in the next nigthly 'devel' build (check-out from git at about 6 pm Eastern; build completion around 2pm Eastern the next day) at

https://bioconductor.org/checkResults/

(Builds sometimes fail, so ensure that the date stamps on the main landing page are consistent with the addition of your package). Once the package builds successfully, you package will be available for download in the 'Devel' version of Bioconductor using BiocManager::install("StructuralVariantAnnotation"). The package 'landing page' will be created at

https://bioconductor.org/packages/StructuralVariantAnnotation

If you have any questions, please contact the bioc-devel mailing list (https://stat.ethz.ch/mailman/listinfo/bioc-devel); this issue will not be monitored further.