fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

sv merge problem #127

Open hygine opened 4 years ago

hygine commented 4 years ago

merge sv: the following questions were found: 1) All_samples_gt.vcf: 15 2561237 del0022944sur, one sample showing 1/1: na: 47: 0,15:+-:del:15_2561237-15_2561242, the other sample showing: 1/1: na: 47: 0,18:+-:del:15_2561237-15_2561242, The merge result shows DEL occurred at 15: 2561237-2561422. But R01.vcf shows 15 2561238 end = 2561279, i.e. DEL occurs at 15: 2561238-2561279; R02.vcf shows that 15 25612378 end = 2561422, i.e. del occurred at 15: 25612378-2561422; Then why only one of the two samples, R02, was retained after merge? This section of R01 and R02 has only a 1bp intersection. And when outputting, is it better to directly output the coordinates originally detected in the 2 samples? Given a large interval after merge, START END, is it feasible to output the original detected coordinate interval for each subsequent sample genotype?

2) In All_samples_gt.vcf: 15 25172622 del0022941sur, showing 1/1: na: 37: 2,12:+-:del:15_25172622-15_25172659 0/1: na: 37: 6,8:+-:del:15_25172622-15_25172659 That is, DEL, R01 homozygous and R02 heterozygous occurred in chr15:25172622-25172659. R01.vcf shows 15 25172622 end = 2517 2659, R01 is 0/1:4:12 R02.vcf shows that DEL and INS occurred at the two positions of 15 251116992 and 15 25137689, while there is no SV in the range of CHR 15: 25100000-25200000 for the rest, then How to get 15 25172622 from All_samples_gt.vcf and homozygous and heterozygous DEL occurred in 2 samples respectively?

@fritzsedlazeck @lindenb @wdecoster

hygine commented 4 years ago

发现的疑问点如下: 1)All_samples_gt.vcf中 15 25612378 DEL0022944SUR,显示1/1:NA:47:0,15:+-:DEL:15_25612378-15_25612422 1/1:NA:47:0,18:+-:DEL:15_25612378-15_25612422, 即15:25612378-25612422发生 DEL; R01.vcf 中显示15 25612328 END=25612379,即15:25612328-25612379 发生DEL ; R02.vcf 中显示 15 25612378 END=25612422,即15:25612378-25612422 发生DEL ; 那么merge后为什么只保留了2个样本中其中的一个,即R02的位置坐标?R01和R02的这段区域只有1bp交集。而且输出的时候是不是将2个样本中原始检测的坐标直接输出会比较好呢?给一个merge后大的区间,START END,在后面的每个样本基因型时输出各自原来检测到的坐标区间是否可行? 2)All_samples_gt.vcf中 15 25172622 DEL0022941SUR,显示 1/1:NA:37:2,12:+-:DEL:15_25172622-15_25172659 0/1:NA:37:6,8:+-:DEL:15_25172622-15_25172659 即chr15:25172622-25172659发生了DEL,R01纯合,R02杂合; R01.vcf 中显示 15 25172622 END=25172659,R01为0/1:4:12 R02.vcf 中显示 15 25116992 15 25137689 这2个位置发生了DEL 和 INS,其他没有chr15:25100000-:25200000范围内的SV,那么 All_samples_gt.vcf中是如何得到 15 25172622分别在2个样本中发生了纯合和杂合DEL呢?

fritzsedlazeck commented 4 years ago

Hi, the merge combines two overlapping SV to one entry and notes the merge in the GT fileds. Can you post a part of the VCF file? This is a bit hard to follow for me. Thanks Fritz

biozzq commented 4 years ago

Dear @fritzsedlazeck

Follow another question related to this title. When performing SV merging (one sample called using different tools), does SURVIVOR take the GT into consideration? I found it didn't (see following example). Do we need to filter out those genotypically inconsistent regions?

1 264391 DUP00000021 A <DUP> 3508 PASS SUPP=2;SUPP_VEC=101;SVLEN=265;SVTYPE=DUP;SVMETHOD=SURVIVOR1.0.7;CHR2=1;END=264712;CIPOS=0,156;CIEND=0,43;STRANDS=-+ GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:208:0,6:-+:3508:DUP:8:NA:NA:1_264547-1_264755 ./.:NaN:0:0,0:--:NaN:NaN:NaN:NAN:NAN:NAN 0/0:NA:321:411,0:-+:198:DUP:DUP00000021:NA:NA:1_264391-1_264712

One more question, do you have some suggested tools when merging SVs from different samples?

Sincerely, Zhuqing

hygine commented 4 years ago

Hi, the merge combines two overlapping SV to one entry and notes the merge in the GT fileds. Can you post a part of the VCF file? This is a bit hard to follow for me. Thanks Fritz

ok, I have send the file to your email , thanks to check the problem. @fritzsedlazeck

fritzsedlazeck commented 4 years ago

Dear @biozzq , no SURVIVOR does not take the GT into account as many tools often dont report the GT.

The example you show here is such a case. You have 1 program calling it a het (0/1) one program does not infer it (./.) and one program labels it as homozygous reference (0/0).

We are using SURVIVOR for inter and intra sample merges. We recently used it to merge multiple thousands of human genomes together.

Hope that helps. Fritz

fritzsedlazeck commented 4 years ago

Dear @hygine , thanks for sending the sample.

@1: Seems to work for me. Maybe try to pull the newest code version and try again: 15 25612378 7916 N <DEL> . PASS **SUPP=2**;SUPP_VEC=11;SVLEN=-44;S

@2: I am not 100% sure I understand your point. I think however that I also cannot reproduce this: 15 25137689 7914 N <INS> . PASS SUPP=2;SUPP_VEC=11;SVLEN=88;SVTYPE=INS;SVMETHOD=SURVIVOR1.0.7;CHR2=15;END=25137779;CIPOS=0,20;CIEND=0,20;STRANDS=+- GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:93:20,27:+-:.:INS:6796:NA:NA:15_25137709-15_25137794 0/1:NA:88:21,16:+-:.:INS:7914:NA:N 15 25172622 6797 N <DEL> . PASS SUPP=1;SUPP_VEC=10;SVLEN=-37;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=15;END=25172659;CIPOS=0,0;CIEND=0,0;STRANDS=+- GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:37:4,12:+-:.:DEL:6797:NA:NA:15_25172622-15_25172659 ./.:NaN:0:0,0:--:NaN:NaN:NaN:NAN:N

So the del is reported only to have 1 support and the ins was found in both samples, which seems to be the correct result.

Please let me know if I overlooked something or if this is resolved with the newest code.

Thanks Fritz

biozzq commented 4 years ago

Dear @fritzsedlazeck,

Thank you. After you merge thousands of genomes, which software do you use to call the genotypes? Maybe you will never do genotype again as the input VCF file for each sample already contains genotype information.

Sincerely, Zhuqing

fritzsedlazeck commented 4 years ago

Dear @biozzq We typically dont do genotyping often. There are tools that you can use but most of them are quite optimized for their own SV Caller: https://academic.oup.com/gigascience/article/8/9/giz110/5565134

We recently also published this: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1909-7

Its still not very push button! Hope that helps Fritz

hygine commented 4 years ago

Dear @fritzsedlazeck

@1: always have problem, merge result: 15 25612378 7916 N . PASS SUPP=2;SUPP_VEC=11;SVLEN=-44;SVTYPE=DEL;SVMETHOD=SURVIVOR1.0.7;CHR2=15;END=25612422;CIPOS=-50,0;CIEND=-43,0;STRANDS=+- GT:PSV:LN:DR:ST:QV:TY:ID:RAL:AAL:CO 0/1:NA:51:12,15:+-:.:DEL:6799:NA:NA:15_25612328-15_25612379 1/1:NA:44:1,18:+-:.:DEL:7916:NA:NA:15_25612378-15_25612422

but R01.vcf result: DEL:6799:NA:NA:15_25612328-15_25612379 but R02.vcf result: DEL:7916:NA:NA:15_25612378-15_25612422 why merge result: DEL:7916;15_25612378-15_25612422

seems like only save the R02.vcf's SV result? and why?

fritzsedlazeck commented 4 years ago

What was your parameter? If you put in 1000 for the distance than this result is fine. The start and stop are within 1000bp of each other. The resulting entry is chosen to be the 2nd one and in the genotype fields you have the results from R01 + R02 listed.

SURVIVOR always chooses one to represent the other to make the entry a valid entry in the VCF file. hope that helps Fritz

hygine commented 4 years ago

I suggest that if two SVs has overlap, we can set a overlap parameter to merge the overlap SVs. Such as if we set the overlap parameter (30 bp), if the two SVs overlap more than 30bp so these SV can be merged one SV, and the start position is the smallest position of two SV's start position, the end position is the biggest position of two SV's end position. If two SVs overlap less than 30bp, these SVs are the independent SV. How do you think that? @fritzsedlazeck

fritzsedlazeck commented 4 years ago

I am sorry but this is against what we saw in many studies. E.g. if you have a population, there is a difference in the allele if there is e.g. a deletion or duplication partially overlapping (e.g. 30bp). Previously the field has looked at 50% reciprocal overlaps. This has been done for a while due to (in my opinion) the inaccuracy of breakpoints from CNV callers and that they mainly looked at very larger events (multiple hundred kb - Mbp).

Our approach is based on our experience across several population studies or multi sample/ platform studies, where we want to be more precise in what allele is what.

Thus, this is not what SURVIVOR merge is doing and simply wont in the future.

Thanks Fritz

yangyxt commented 2 years ago

Hi, why can't the users have a parameter specifying the reciprocal overlap fraction threshold of two SV events for them to be deemed as referring to the same one?

Also, I wonder whether SURVIVOR uses graph theory clique to determine the global optimum merging set. It is too fast to seemingly adopt this strategy...

Anyway, thank you for making this tool. At least now merging hundreds of samples' of SV records are computationally plausible.

fritzsedlazeck commented 2 years ago

Thanks. yeah no graph behind it. the parameter is not well tested and adjusted. That is why I didn't activate it.

Cheers Fritz