Clinical-Genomics / scout

VCF visualization interface
https://clinical-genomics.github.io/scout
BSD 3-Clause "New" or "Revised" License
150 stars 46 forks source link

Many SV variants from a VCF produced by SVDB are recognized as the same variant and not loaded #3492

Open northwestwitch opened 2 years ago

northwestwitch commented 2 years ago

For instance if you have a translocation from chrA:posA to chrB:posB and chrC:posC, only translocation from chrA:posA to chrB:posB is loaded as a variant

Of course this happens more often in cancer cases than RD cases

northwestwitch commented 2 years ago

Example provided by the cancer team:

14  106531322   SV_4005_1   N   ]14:107010501]N .   PASS    SVTYPE=BND;REGIONA=106531322,106531405;REGIONB=107010346,107010501;LFA=17,0;LFB=17,0;LTE=6,0;CTG=.  GT:cn:COV:DR:SR:LQ:RR:RD    0/1:2:30,27.162624934793953,41:6:0:0.0,0.0:0,26:5,33
14  106531322   SV_4006_1   N   ]14:106712012]N .   PASS    SVTYPE=BND;REGIONA=106531322,106531378;REGIONB=106712012,106712150;LFA=15,0;LFB=15,0;LTE=8,0;CTG=CCACATAATCTAAGTGGGACCTCAGCATTGAGCATTCATGGACATAAATGTGCGAATGATAGACACTGTGGACTGCTGGAGAGTGGAGGGAGGGGGTGATGGAATCTGGATTCCAAACCTCAGCATCACTCAATAATCCCATGTGACAAGTCCACACATATGCCCTCTGTATCTGAATGAAAACTTGAAATTAAATAAAAATCCTTATGTGAGAGCTGACTGGAAGCACCAAAGAGGACACTTGTTGTGGAGATTGACCTGCTCCTCATCCTAACTTAGGTGCTGGAGACAAATGTGTGCACATATGTC  GT:cn:COV:DR:SR:LQ:RR:RD    0/1:2:27,24.460027662517287,46:8:0:0.0,0.0:0,28:5,50
14  106712012   SV_4006_2   N   N[14:106531322[ .   PASS    SVTYPE=BND;REGIONA=106531322,106531378;REGIONB=106712012,106712150;LFA=15,0;LFB=15,0;LTE=8,0;CTG=CCACATAATCTAAGTGGGACCTCAGCATTGAGCATTCATGGACATAAATGTGCGAATGATAGACACTGTGGACTGCTGGAGAGTGGAGGGAGGGGGTGATGGAATCTGGATTCCAAACCTCAGCATCACTCAATAATCCCATGTGACAAGTCCACACATATGCCCTCTGTATCTGAATGAAAACTTGAAATTAAATAAAAATCCTTATGTGAGAGCTGACTGGAAGCACCAAAGAGGACACTTGTTGTGGAGATTGACCTGCTCCTCATCCTAACTTAGGTGCTGGAGACAAATGTGTGCACATATGTC  GT:cn:COV:DR:SR:LQ:RR:RD    0/1:2:27,24.460027662517287,46:8:0:0.0,0.0:0,28:5,50
14  107010501   SV_4005_2   N   N[14:106531322[ .   PASS    SVTYPE=BND;REGIONA=106531322,106531405;REGIONB=107010346,107010501;LFA=17,0;LFB=17,0;LTE=6,0;CTG=.  GT:cn:COV:DR:SR:LQ:RR:RD    0/1:2:30,27.162624934793953,41:6:0:0.0,0.0:0,26:5,33
northwestwitch commented 2 years ago

Examples: https://scout-stage.scilifelab.se/cust059/G1A1471p10_Balsamic and https://scout-stage.scilifelab.se/cust083/KMP-00064T-20191996305 . Check also my comment when testing loading these vars: https://github.com/Clinical-Genomics/scout/pull/3491#issuecomment-1173480360

dnil commented 5 days ago

Right, this situation will become untenable at some point. A suggestion would be to start adding a unique object index to the structural variants, and possibly add an extra step of detailed checking on fails against uniqueness to the current _id where we look also at the other fields.

dnil commented 5 days ago

So we remember it: for the overlap SNV-SV, @northwestwitch had the idea to add additional callers to the first variant, if the variants check out to be similar enough. This would be excellent there, but of course not solve the multiple-differerent-endpoint-bnds issue.

dnil commented 5 days ago

I currently do not understand though why we wouldn't get unique ids from the example above, although I know it can happen with some callers. The (ALT, REF) field pairs all look unique?

northwestwitch commented 5 days ago

I currently do not understand though why we wouldn't get unique ids from the example above though, although I know it can happen with some callers. The (ALT, REF) field pairs all look unique?

Unless we just use the chrom - start positions in the parsing

dnil commented 5 days ago

I currently do not understand though why we wouldn't get unique ids from the example above though, although I know it can happen with some callers. The (ALT, REF) field pairs all look unique?

Unless we just use the chrom - start positions in the parsing

It should be this, also for SVs?! https://github.com/Clinical-Genomics/scout/blob/6086937b4cb0da38fe52234a813fd14996d79a7e/scout/parse/variant/variant.py#L70

Let us check the cyvcf2 parsing when time allows. We could either check some existing case, with some decent callers, and see how the alts look in the db. Or simply print debug while parsing a small test.

dnil commented 1 day ago

Without testing the variants at hand, at least in general parsing ALTs via cyvcf2 into db seems ok - see the alternative for this BND:

Screenshot 2024-10-21 at 13 48 07