brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
237 stars 21 forks source link

SU change after SVtyper #96

Open lee039 opened 5 years ago

lee039 commented 5 years ago

Hi, could you please help me understanding how SU is counted in Lumpy and SVtyper?

I run smoove with the following steps:

  1. samblaster -> 2.smoove call -> 3.smoove merge -> 4.smoove genotype ->5.smoove paste

I was inspecting output file from "smoove paste" step to see what is the minimum SU to call SVs. In my data, the minimum seems like SU=4 (i.e. PE=3, SR=1, or PE=4, SR=0).

The example is:

chr6 37085977 7483 N 812.44 . SVTYPE=DEL;SVLEN=-707;END=37086684;STRANDS=+-:4;IMPRECISE;CIPOS=-30,705;CIEND=-619,29;CIPOS95=-5,243;CIEND95=-203,7;SU=4;PE=4;SR=0;GCF=0.514124;SNAME=sample_1:1671;ALG=PROD;AN=6;AC=3 GT:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB:DHFC:DHFFC:DHBFC:DHSP 0/1:157:812.44:-82,-1,-17:71:36:35:35:34:13:0:7:22:27:0.49:0:0:0:44 0/1:86:86.53:-18,-10,-50:73:61:11:61:10:23:0:0:38:10:0.14:0.461538:0.428571:0.571429:21 0/1:200:341.46:-39,-5,-40:78:58:19:57:19:20:0:3:37:16:0.25:0.407407:0.354839:0.478261:26

Then, I checked this site in the individual vcf files from "2.smoove call" step. This is the site from sample_1: chr6 37085977 1671 N 812.44 . SVTYPE=DEL;SVLEN=-707;END=37086684;STRANDS=+-:4;IMPRECISE;CIPOS=-30,705;CIEND=-619,29;CIPOS95=-5,243;CIEND95=-203,7;SU=4;PE=4;SR=0;PRPOS=(too long to paste all the PRPOS values, so I skip) AC=1;AN=2;GCF=0.514124 GT:GQ:SQ:GL:DP:RO:AO:QR:QA:RS:AS:ASC:RP:AP:AB:DHFC:DHFFC:DHBFC:DHSP 0/1:157:812.44:-82,-1,-17:71:36:35:35:34:13:0:7:22:27:0.49:0:0:0:44 The other two samples do not have this site called.

My questions are: 1. This site was detected by one sample based on SU=4 (PE=4, SR=0) and AP was 22. To my understanding, in an individual vcf file, generated from "smoove call", PE and AP should be the same or the difference should not be this large. Or am I getting something wrong here?

2. For the site I showed above, SVTyper gave 0/1 to the other two samples as well. Yet SU is still SU=4 in the population vcf. Then how does SVtyper genotyped without collecting any SU from the two samples?

  1. When I look at the site using igv, giving disc.bams and split.bams as input, I see the following: image So, actually sample 2 and sample 3 seem to have PEs that support this site. I am puzzled why these reads are not counted by SVtyper.

Thanks a lot in advance for your comments! :)

Lim

brentp commented 5 years ago

If I remember correctly, SU is from the original lumpy call from one of the samples. So it's not informative after genotyping and merging. SVTyper and lumpy will count slightly differently. Usually these numbers are close, but after merging (and recalculating the confidence intervals) the numbers can change.