Open schelhorn opened 9 years ago
Sven-Eric; This is a brilliant idea, thank you. Adding an annotation step to the final VCFs with this information is a great next step. In addition to the resources you mention some other things we current do inside bcbio:
Thanks again for getting this conversation started.
Great, thanks for looking into my suggestions. Let's see when we have time for coming back to this. Perhaps we can use @brentp's vcfanno
for the normal annotation of the SV result files once the metasv
integration has been completed.
Another batch of resources that are of relevance in this regard are known or putative cancer SVs that should be subtracted from normal SV databases since they may be present in normal individuals but predispose to cancer. A list of these would be:
And a recent preprint on CNVs in low mapability regions: http://biorxiv.org/content/early/2015/12/11/034165
Another set of normal SVs: http://www.nature.com/articles/ncomms12989
Here ist a set of crowdsourced gold-standard CNVs on GiaB data: http://biorxiv.org/content/early/2016/12/13/093526
Michael Talkowski just published their list of (complex) germline SVs: https://paperpile.com/shared/472Jf5
And another large study: http://biorxiv.org/content/early/2017/03/22/119461
Some additional references on validating SV calls:
Pinging @pdiakumis to this thread (and https://github.com/chapmanb/bcbio-nextgen/issues/1592#event-1320502818) for a list of SV references. We are currently testing Manta/BPI and GRIDDS on a number of samples including a partially validated list of SV events from COLO829; happy to compare notes particularly for any somatic SVs.
We are also looking into comparing Manta and GRIDSS calls on a number of 10X tumor/normal WGS samples (with the 10X SV caller as the reference).
SVs determined from linked read sequencing by 10x Genomics, referencing samples with known SVs of different classes: https://www.biorxiv.org/content/early/2017/12/08/231662
(...) a set of 23 samples with known balanced, unbalanced or complex SVs from either 1) the GetRm CNV Panel (unbalanced events) or 2) the Coriell general Cell Repository (balanced events). These cell lines have multiple, orthogonal assays confirming the presence of their described structural variants
And another one, for deletions: https://academic.oup.com/nar/advance-article/doi/10.1093/nar/gkx1175/4647672
Additionally we used the short read dataset SRX652547 generated from the CHM1 cell line, which is derived from a haploid genome. It contains Illumina paired-end reads of length 101 bp sequenced at a 41-fold coverage. Interestingly, a list of variants was compiled for the same cell line by sequencing conjointly single-molecule long reads with Pacific Biosciences instrument (PacBio) at a 54-fold coverage (24). We used this list of variants to evaluate the quality of the predicted deletions.
An ensemble method with machine learning bolted on: https://www.biorxiv.org/content/biorxiv/early/2017/11/17/113498.full.pdf
Our training set included high coverage (48x, N=27) and low coverage (7x, N=2,494) WGS from the 1KGP. Training features were collectedfrom a gold-standard SV call set on the above individuals with an estimated false discovery rate (FDR) of 1-4% (Sudmant, et al., 2015), totaling297,131 genotypes from 11,747 unique loci.
Another gold-standard SV set:
http://eichlerlab.gs.washington.edu/publications/chm1-structural-variation/
A new aggregated set by NCBI's dbVar - seems to be the best one so far: https://github.com/ncbi/dbvar/tree/master/Structural_Variant_Sets/Nonredundant_Structural_Variants
Community effort for annotating SVs for GIAB: http://www.svcurator.com/
Breast cancer cell line (SK-BR-3 ) sequenced and assembled using PacBio long-read tech, with detailed SVs: http://m.genome.cshlp.org/content/early/2018/06/28/gr.231100.117.abstract
gnomad-SV project has a list too: https://www.biorxiv.org/content/10.1101/578674v1
Finally, VEP has an annotator plugin for SV and annotates with Gnomad's SVs: http://www.ensembl.info/2020/03/27/cool-stuff-the-ensembl-vep-can-do-annotating-structural-variants/
Awesome!! I wonder if Pablo wants to add this to snpEff. :)
Edit: despite the title this link list now contains all kinds of reference material for benchmarking/normal-filtering SV (including CNV) calls on a population level as well on single sample (mostly cell line) level.
For tumor-only calling of SVs, it would be useful to be able to annotate putative called SVs with similar (wrt. structure, copynumber, and content) SV calls from public databases of normal controls in order to flag potential false-positives. This is analogous to the ExAC and 1kG reference panels for micro variants. I suggest that we could use this issue to collect such resources and package them into bcbio once we feel that is warranted. I offer to start with five resources known to me - I am sure there are many more:
CopywriteR::preCopywriteR()
function. The origin of these region is currently unknown to me.