J35P312 / SVDB

structural variant database software
MIT License
38 stars 16 forks source link

Merging BNDs in different chromosomes #55

Closed fgvieira closed 2 years ago

fgvieira commented 2 years ago

I am trying to merge some VCFs with SVDB v2.6.4, but I am having some trouble with BNDs.

If I have a VCF with two BNDs:

chr3    195691426       MantaBND:34332:0:1:0:0:0:0      G       G]chr5:1636168] 626     PASS    BND_DEPTH=90;CIPOS=0,12;HOMLEN=12;HOMSEQ=ATTCAGTCCTGG;MATEID=MantaBND:34332:0:1:0:0:0:1;MATE_BND_DEPTH=63;SVTYPE=BND;POS=195691426    GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31
chr5    1636156 MantaBND:34332:0:1:0:0:0:1      T       T]chr3:195691438]       626     PASS    BND_DEPTH=63;CIPOS=0,12;HOMLEN=12;HOMSEQ=CCAGGACTGAAT;MATEID=MantaBND:34332:0:1:0:0:0:0;MATE_BND_DEPTH=90;SVTYPE=BND;POS=1636156      GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31

and merge it with itself:

svdb --merge --vcf test_manta.vcf.gz:truth test_manta.vcf.gz:query --priority truth,query --pass_only --no_intra --overlap 0.1 --bnd_distance 1000 --ins_distance 50

I get:

chr3    195691426       MantaBND:34332:0:1:0:0:0:0:truth|MantaBND:34332:0:1:0:0:0:0:query|MantaBND:34332:0:1:0:0:0:1:query      G       G]chr5:1636168] 626     PASS    BND_DEPTH=90;CIPOS=0,12;HOMLEN=12;HOMSEQ=ATTCAGTCCTGG;MATEID=MantaBND:34332:0:1:0:0:0:1;MATE_BND_DEPTH=63;SVTYPE=BND;POS=195691426;VARID=MantaBND:34332:0:1:0:0:0:0:query|MantaBND:34332:0:1:0:0:0:1:query;set=Intersection;FOUNDBY=2;truth_CHROM=MantaBND_34332_0_1_0_0_0_0|chr3;query_CHROM=MantaBND_34332_0_1_0_0_0_0|chr3,MantaBND_34332_0_1_0_0_0_1|chr5;truth_POS=MantaBND_34332_0_1_0_0_0_0|195691426;query_POS=MantaBND_34332_0_1_0_0_0_0|195691426,MantaBND_34332_0_1_0_0_0_1|1636156;truth_QUAL=MantaBND_34332_0_1_0_0_0_0|626;query_QUAL=MantaBND_34332_0_1_0_0_0_0|626,MantaBND_34332_0_1_0_0_0_1|626;truth_FILTERS=MantaBND_34332_0_1_0_0_0_0|PASS;query_FILTERS=MantaBND_34332_0_1_0_0_0_0|PASS,MantaBND_34332_0_1_0_0_0_1|PASS;truth_SAMPLE=MantaBND_34332_0_1_0_0_0_0|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;query_SAMPLE=MantaBND_34332_0_1_0_0_0_0|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31,MantaBND_34332_0_1_0_0_0_1|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;truth_INFO=MantaBND_34332_0_1_0_0_0_0|BND_DEPTH:90|CIPOS:0:12|HOMLEN:12|HOMSEQ:ATTCAGTCCTGG|MATE_BND_DEPTH:63|SVTYPE:BND|POS:195691426;query_INFO=MantaBND_34332_0_1_0_0_0_0|BND_DEPTH:90|CIPOS:0:12|HOMLEN:12|HOMSEQ:ATTCAGTCCTGG|MATE_BND_DEPTH:63|SVTYPE:BND|POS:195691426,MantaBND_34332_0_1_0_0_0_1|BND_DEPTH:63|CIPOS:0:12|HOMLEN:12|HOMSEQ:CCAGGACTGAAT|MATE_BND_DEPTH:90|SVTYPE:BND|POS:1636156;svdb_origin=truth|query  GT:FT:GQ:PL:PR:SR       0/1:PASS:626:676,0,999:81,1:59,31
chr5    1636156 MantaBND:34332:0:1:0:0:0:1      T       T]chr3:195691438]       626     PASS    BND_DEPTH=63;CIPOS=0,12;HOMLEN=12;HOMSEQ=CCAGGACTGAAT;MATEID=MantaBND:34332:0:1:0:0:0:0;MATE_BND_DEPTH=90;SVTYPE=BND;POS=1636156;set=truth;FOUNDBY=1;truth_CHROM=MantaBND_34332_0_1_0_0_0_1|chr5;truth_POS=MantaBND_34332_0_1_0_0_0_1|1636156;truth_QUAL=MantaBND_34332_0_1_0_0_0_1|626;truth_FILTERS=MantaBND_34332_0_1_0_0_0_1|PASS;truth_SAMPLE=MantaBND_34332_0_1_0_0_0_1|NA24385-NA24385_fda_truthv2-Normal_Blood_noinfo-WGS_v1_noinfo_noinfo-210628_GIABprecisionFDA_GIABFDA99-precisionFDA_noinfo_WGSvalidation-S1_001|GT:0/1|FT:PASS|GQ:626|PL:676:0:999|PR:81:1|SR:59:31;truth_INFO=MantaBND_34332_0_1_0_0_0_1|BND_DEPTH:63|CIPOS:0:12|HOMLEN:12|HOMSEQ:CCAGGACTGAAT|MATE_BND_DEPTH:90|SVTYPE:BND|POS:1636156;svdb_origin=truth  GT:FT:GQ:PL:PR:SR     0/1:PASS:626:676,0,999:81,1:59,31

It seems it is merging three BNDs, one of them on another chr(!):

And leaving MantaBND:34332:0:1:0:0:0:1:truth unmerged. Why is it not merging MantaBND:34332:0:1:0:0:0:0:truth with MantaBND:34332:0:1:0:0:0:0:query, and MantaBND:34332:0:1:0:0:0:1:truth with MantaBND:34332:0:1:0:0:0:1:query?

J35P312 commented 2 years ago

Hello! Sometimes SVDB will have a peculiar behaviour!

This is caused by the "--no_intra" option, i.e SVDB is not allowed to merge variants within the same vcf, only between different vcf files.

test_manta.vcf.gz:truth

has the highest priority, and contains

chr3 195691426 MantaBND:34332:0:1:0:0:0:0 G G]chr5:1636168] 626 PASS BND_DEPTH=90;CIPOS=0,12;HOMLEN=12;HOMSEQ=ATTCAGTCCTGG;MATEID=MantaBND:34332:0:1:0:0:0:1;MATE_BND_DEPTH=63;SVTYPE=BND;POS=195691426 GT:FT:GQ:PL:PR:SR 0/1:PASS:626:676,0,999:81,1:59,31 chr5 1636156 MantaBND:34332:0:1:0:0:0:1 T T]chr3:195691438] 626 PASS BND_DEPTH=63;CIPOS=0,12;HOMLEN=12;HOMSEQ=CCAGGACTGAAT;MATEID=MantaBND:34332:0:1:0:0:0:0;MATE_BND_DEPTH=90;SVTYPE=BND;POS=1636156 GT:FT:GQ:PL:PR:SR 0/1:PASS:626:676,0,999:81,1:59,31

Hence, those two variants will never be merged.

The merging is done through a greedy approach:

SVDB will start with the first variant (MantaBND:34332:0:1:0:0:0:0) in the file with highest priority (test_manta.vcf.gz:truth), and compare it to variants in test_manta.vcf.gz. Both variants are similar according to the BND distance, and so, MantaBND:34332:0:1:0:0:0:0 will consume both variants in test_manta.vcf.gz:query; leaving nothing for MantaBND:34332:0:1:0:0:0:1.

I see you are doing this for benchmarking: I suggest you either use the SVDB query, then you can use the OCC tag to check for presence of absence in thruthset, or you merge the variants within truthset and query first, to make sure that there are not many copies of the "same" variant.

Good luck, and feel free to ask if you have other questions! //Jesper

fgvieira commented 2 years ago

Hi and thanks for the quick reply. However, when you say:

SVDB will start with the first variant (MantaBND:34332:0:1:0:0:0:0) in the file with highest priority (test_manta.vcf.gz:truth), and compare it to variants in test_manta.vcf.gz. Both variants are similar according to the BND distance [...]

How are both variants similar if they are on different chromosomes? SVDB will start with the first variant in truth and compare it to the variants in query. The first one matches (since they are the same), but the second variant is in another chromosome! Why does it also get merged?

J35P312 commented 2 years ago

No worries, thanks for your patience =P.

SVDB stores each variant as coordinates: (chrA,posA,chrB,posB). Internally, SVDB will sort the coordinates such that chrA < chrB, and if chrA == chrB, then posA < posB.

Hence

chr3 195691426 MantaBND:34332:0:1:0:0:0:0 G G]chr5:1636168]

will become (chr3,195691426,chr5,1636168)

and

chr5 1636156 MantaBND:34332:0:1:0:0:0:1 T T]chr3:195691438]

(chr3,195691438,chr5,1636156)

have a nice weekend! //jesper