Resources for SVs of normal populations in order to prioritize tumor-only variant calls

schelhorn commented 8 years ago

Edit: despite the title this link list now contains all kinds of reference material for benchmarking/normal-filtering SV (including CNV) calls on a population level as well on single sample (mostly cell line) level.

For tumor-only calling of SVs, it would be useful to be able to annotate putative called SVs with similar (wrt. structure, copynumber, and content) SV calls from public databases of normal controls in order to flag potential false-positives. This is analogous to the ExAC and 1kG reference panels for micro variants. I suggest that we could use this issue to collect such resources and package them into bcbio once we feel that is warranted. I offer to start with five resources known to me - I am sure there are many more:

Database of Genomic Variants (DGV). The objective of the Database of Genomic Variants is to provide a comprehensive summary of structural variation in the human genome. We define structural variation as genomic alterations that involve segments of DNA that are larger than 50bp. The content of the database is only representing structural variation identified in healthy control samples. Also contains 1kG variants as well as A Copy Number Variation Map of the Human Genome (Nature Reviews Genetics, 2015) data.
Database of Genomic Variants Archive is a repository that provides archiving, accessioning and distribution of publicly available genomic structural variants, in all species. It exchanges data with both dbVar and DGV. May contain also non-normal patient variants.
The DECIPHER database contains CNVs for patients suffering from rare diseases. These may be of less interest to cancer research due to the low population frequencies of these variants.
The ENCODE blacklisted regions that show abnormal read count depths and were excluded from functional analyses in ENCODE. These regions will likely confuse read-depth based SV methods.
CopywriteR blacklisted CNV regions. These are generated as an R data object by calling the Bioconductor CNV caller CopywriteR::preCopywriteR() function. The origin of these region is currently unknown to me.
A recent WGS study of 1,000 Japanese individuals and 250 Dutch individuals resulted in whole-genome identification of genic SVs that are part of the supplemental materials of these papers
A CNV study of 236 individual genomes from 125 human populations

chapmanb commented 8 years ago

Sven-Eric; This is a brilliant idea, thank you. Adding an annotation step to the final VCFs with this information is a great next step. In addition to the resources you mention some other things we current do inside bcbio:

Annotate calls that overlap high depth regions. These are often collapsed repeats that result in spurious calls.
Annotating calls where either end falls into a centromere and telomere. Most callers have filters for these, but we could annotate as well if we notice them being a source of noise.

Thanks again for getting this conversation started.

schelhorn commented 8 years ago

Great, thanks for looking into my suggestions. Let's see when we have time for coming back to this. Perhaps we can use @brentp's vcfanno for the normal annotation of the SV result files once the metasv integration has been completed.

schelhorn commented 8 years ago

Another batch of resources that are of relevance in this regard are known or putative cancer SVs that should be subtracted from normal SV databases since they may be present in normal individuals but predispose to cancer. A list of these would be:

Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer (part of CGAP). The information in the Mitelman Database of Chromosome Aberrations and Gene Fusions in Cancer relates chromosomal aberrations to tumor characteristics, based either on individual cases or associations. All the data have been manually culled from the literature by Felix Mitelman, Bertil Johansson, and Fredrik Mertens.

schelhorn commented 8 years ago

And a recent preprint on CNVs in low mapability regions: http://biorxiv.org/content/early/2015/12/11/034165

schelhorn commented 7 years ago

Another set of normal SVs: http://www.nature.com/articles/ncomms12989

schelhorn commented 7 years ago

Here ist a set of crowdsourced gold-standard CNVs on GiaB data: http://biorxiv.org/content/early/2016/12/13/093526

ohofmann commented 7 years ago

Michael Talkowski just published their list of (complex) germline SVs: https://paperpile.com/shared/472Jf5

schelhorn commented 7 years ago

And another large study: http://biorxiv.org/content/early/2017/03/22/119461

schelhorn commented 6 years ago

ExAC CNVs: http://blog.goldenhelix.com/grudy/exac-cnvs-the-first-large-scale-public-exome-cnv-variant-set/

schelhorn commented 6 years ago

Some additional references on validating SV calls:

ohofmann commented 6 years ago

Pinging @pdiakumis to this thread (and https://github.com/chapmanb/bcbio-nextgen/issues/1592#event-1320502818) for a list of SV references. We are currently testing Manta/BPI and GRIDDS on a number of samples including a partially validated list of SV events from COLO829; happy to compare notes particularly for any somatic SVs.

We are also looking into comparing Manta and GRIDSS calls on a number of 10X tumor/normal WGS samples (with the 10X SV caller as the reference).

schelhorn commented 6 years ago

SVs determined from linked read sequencing by 10x Genomics, referencing samples with known SVs of different classes: https://www.biorxiv.org/content/early/2017/12/08/231662

(...) a set of 23 samples with known balanced, unbalanced or complex SVs from either 1) the GetRm CNV Panel (unbalanced events) or 2) the Coriell general Cell Repository (balanced events). These cell lines have multiple, orthogonal assays confirming the presence of their described structural variants

And another one, for deletions: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1175/4647672

Additionally we used the short read dataset SRX652547 generated from the CHM1 cell line, which is derived from a haploid genome. It contains Illumina paired-end reads of length 101 bp sequenced at a 41-fold coverage. Interestingly, a list of variants was compiled for the same cell line by sequencing conjointly single-molecule long reads with Pacific Biosciences instrument (PacBio) at a 54-fold coverage (24). We used this list of variants to evaluate the quality of the predicted deletions.

An ensemble method with machine learning bolted on: https://www.biorxiv.org/content/biorxiv/early/2017/11/17/113498.full.pdf

Our training set included high coverage (48x, N=27) and low coverage (7x, N=2,494) WGS from the 1KGP. Training features were collectedfrom a gold-standard SV call set on the above individuals with an estimated false discovery rate (FDR) of 1-4% (Sudmant, et al., 2015), totaling297,131 genotypes from 11,747 unique loci.

schelhorn commented 6 years ago

Another gold-standard SV set:

http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation/

schelhorn commented 6 years ago

A new aggregated set by NCBI's dbVar - seems to be the best one so far: https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants

schelhorn commented 6 years ago

Community effort for annotating SVs for GIAB: http://www.svcurator.com/

schelhorn commented 6 years ago

Breast cancer cell line (SK-BR-3 ) sequenced and assembled using PacBio long-read tech, with detailed SVs: http://m.genome.cshlp.org/content/early/2018/06/28/gr.231100.117.abstract

roryk commented 4 years ago

gnomad-SV project has a list too: https://www.biorxiv.org/content/10.1101/578674v1

naumenko-sa commented 4 years ago

Finally, VEP has an annotator plugin for SV and annotates with Gnomad's SVs: http://www.ensembl.info/2020/03/27/cool-stuff-the-ensembl-vep-can-do-annotating-structural-variants/

roryk commented 4 years ago

Awesome!! I wonder if Pablo wants to add this to snpEff. :)

bcbio / bcbio-nextgen

Resources for SVs of normal populations in order to prioritize tumor-only variant calls #963