bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
994 stars 353 forks source link

Resources for SVs of normal populations in order to prioritize tumor-only variant calls #963

Open schelhorn opened 9 years ago

schelhorn commented 9 years ago

Edit: despite the title this link list now contains all kinds of reference material for benchmarking/normal-filtering SV (including CNV) calls on a population level as well on single sample (mostly cell line) level.

For tumor-only calling of SVs, it would be useful to be able to annotate putative called SVs with similar (wrt. structure, copynumber, and content) SV calls from public databases of normal controls in order to flag potential false-positives. This is analogous to the ExAC and 1kG reference panels for micro variants. I suggest that we could use this issue to collect such resources and package them into bcbio once we feel that is warranted. I offer to start with five resources known to me - I am sure there are many more:

chapmanb commented 9 years ago

Sven-Eric; This is a brilliant idea, thank you. Adding an annotation step to the final VCFs with this information is a great next step. In addition to the resources you mention some other things we current do inside bcbio:

Thanks again for getting this conversation started.

schelhorn commented 9 years ago

Great, thanks for looking into my suggestions. Let's see when we have time for coming back to this. Perhaps we can use @brentp's vcfanno for the normal annotation of the SV result files once the metasv integration has been completed.

schelhorn commented 9 years ago

Another batch of resources that are of relevance in this regard are known or putative cancer SVs that should be subtracted from normal SV databases since they may be present in normal individuals but predispose to cancer. A list of these would be:

schelhorn commented 8 years ago

And a recent preprint on CNVs in low mapability regions: http://biorxiv.org/content/early/2015/12/11/034165

schelhorn commented 8 years ago

Another set of normal SVs: http://www.nature.com/articles/ncomms12989

schelhorn commented 7 years ago

Here ist a set of crowdsourced gold-standard CNVs on GiaB data: http://biorxiv.org/content/early/2016/12/13/093526

ohofmann commented 7 years ago

Michael Talkowski just published their list of (complex) germline SVs: https://paperpile.com/shared/472Jf5

schelhorn commented 7 years ago

And another large study: http://biorxiv.org/content/early/2017/03/22/119461

schelhorn commented 7 years ago

ExAC CNVs: http://blog.goldenhelix.com/grudy/exac-cnvs-the-first-large-scale-public-exome-cnv-variant-set/

schelhorn commented 7 years ago

Some additional references on validating SV calls:

ohofmann commented 7 years ago

Pinging @pdiakumis to this thread (and https://github.com/chapmanb/bcbio-nextgen/issues/1592#event-1320502818) for a list of SV references. We are currently testing Manta/BPI and GRIDDS on a number of samples including a partially validated list of SV events from COLO829; happy to compare notes particularly for any somatic SVs.

We are also looking into comparing Manta and GRIDSS calls on a number of 10X tumor/normal WGS samples (with the 10X SV caller as the reference).

schelhorn commented 6 years ago

SVs determined from linked read sequencing by 10x Genomics, referencing samples with known SVs of different classes: https://www.biorxiv.org/content/early/2017/12/08/231662

(...) a set of 23 samples with known balanced, unbalanced or complex SVs from either 1) the GetRm CNV Panel (unbalanced events) or 2) the Coriell general Cell Repository (balanced events). These cell lines have multiple, orthogonal assays confirming the presence of their described structural variants

And another one, for deletions: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1175/4647672

Additionally we used the short read dataset SRX652547 generated from the CHM1 cell line, which is derived from a haploid genome. It contains Illumina paired-end reads of length 101 bp sequenced at a 41-fold coverage. Interestingly, a list of variants was compiled for the same cell line by sequencing conjointly single-molecule long reads with Pacific Biosciences instrument (PacBio) at a 54-fold coverage (24). We used this list of variants to evaluate the quality of the predicted deletions.

An ensemble method with machine learning bolted on: https://www.biorxiv.org/content/biorxiv/early/2017/11/17/113498.full.pdf

Our training set included high coverage (48x, N=27) and low coverage (7x, N=2,494) WGS from the 1KGP. Training features were collectedfrom a gold-standard SV call set on the above individuals with an estimated false discovery rate (FDR) of 1-4% (Sudmant, et al., 2015), totaling297,131 genotypes from 11,747 unique loci.

schelhorn commented 6 years ago

Another gold-standard SV set:

http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation/

schelhorn commented 6 years ago

A new aggregated set by NCBI's dbVar - seems to be the best one so far: https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants

schelhorn commented 6 years ago

Community effort for annotating SVs for GIAB: http://www.svcurator.com/

schelhorn commented 6 years ago

Breast cancer cell line (SK-BR-3 ) sequenced and assembled using PacBio long-read tech, with detailed SVs: http://m.genome.cshlp.org/content/early/2018/06/28/gr.231100.117.abstract

roryk commented 5 years ago

gnomad-SV project has a list too: https://www.biorxiv.org/content/10.1101/578674v1

naumenko-sa commented 4 years ago

Finally, VEP has an annotator plugin for SV and annotates with Gnomad's SVs: http://www.ensembl.info/2020/03/27/cool-stuff-the-ensembl-vep-can-do-annotating-structural-variants/

roryk commented 4 years ago

Awesome!! I wonder if Pablo wants to add this to snpEff. :)