Mangul-Lab-USC / benchmarking_SV

Updated figures for "A benchmarking of WGS-based structural variant callers" paper
MIT License
25 stars 7 forks source link

Account for microhomology and CIPOS #8

Open d-cameron opened 4 years ago

d-cameron commented 4 years ago

Many germline events have microhomology at the breakpoint. When matching an event, this should be taken into account as there are many way to report the event in VCF. Figure 9 of the VCF specifications outlines this.

Some callers report CIPOS (and in the case of GRIDSS, the non-standard HOMPOS field) which is what the caller itself thinks the extent of the homology, but others do not.

Properly matching equivalent variants can be very messy. For example, in the HG002 truth set, a sine duplication within a sine repeat (ref=SINE-SINE-SINE , var=SINE-SINE-SINE-SINE) is reported as an INS event after the 3rd SINE, but the short read callers report it as a DUP of the first sine. These calls both result in the same sequence, but they're 600bp away from each other! Events such as these are a bit extreme and difficult to handle but a basic check of homology, and/or respecting the CIPOS reported by the variant caller will result in more accurate benchmarking results.

The delta between the caller event length and the actual event length for TPs is a good indicator of how well a SV gets the length correct. You should find that, in contrast to overall average lengths, length deltra predicted by BreakDancer do not match the actual event lengths very closely - something that has the potent to change someone's choice of caller.

smangul1 commented 4 years ago

Thanks for bringing this up!

Varuni let's discuss this in more details a and address in the next release of data analysis

Serghei