dantaki / SV2

Support Vector Structural Variation Genotyper
58 stars 11 forks source link

reasonable prefiltering of CNV calls #24

Closed furbelows closed 5 years ago

furbelows commented 5 years ago

Hi Dan, nice program.

I'm trying this out on a large collection of WGS called using manta. I just ran some test cases and everything is working ok...

I just had a question - in your experience, what are some reasonable pre-filtering criteria for Manta-based CNV calls....there are quite a few of them and many of them (especially large ones) tend to fail the "eyeball" test...

In your science paper, did you do any prefiltering of calls before you ran your samples through SV2?

This is ~60x.

dantaki commented 5 years ago

Thanks for the interest in my work. The supplement of the Science paper details filtering of SVs we applied. These filters (at least the segmental duplication filter) are also typically used for microarray analyses of CNVs.

As a preliminary filtering step, SVs were removed from the consensus callset if they overlapped by more than 66% with centromeres, segmental duplications, regions with low mappability with 100bp reads, regions subject to somatic V(D)J recombination (parts of anitbodies and Tcell receptor genes). SVs called by Manta or Lumpy were filtered if they had one or both breakpoints overlap ping one of these regions. Regions used for filtering can be found in our previous publication [Brandler,Antaki,Gujral AJHG 2016]

You can find BED files of these features on UCSC Table Browser in the reference build of your choice. The low mappability regions used in the publication was derived from DAC Blacklist, but I would now recommend using the UMAP track (the k24 would be more stringent of a filter, while k100 would be more lenient).

You can also remove SVs that are extremely large; LUMPY and Manta tend to call SVs that are near the size of the chromosome, I think due to repetitive telomeric sequence. Many of these SV calls would be lethal (monosomies/trisomies of chr1 for example). So you can remove those, depending on the context of your study. In ASD, we don't really expect germline SVs to be greater than 25-30Mb, so we typically remove SVs larger than that. Larger SVs require more time for SV2 to process, so keep that in mind if time is of value to you. I hope this helped and if you would like some more information on how to process SVs in WGS, I would check out our publications, Sudmant 2015 Nature, and the most recent 1000 Genomes SV analysis on biorxiv (Chaisson 2017)