Open Thatguy027 opened 6 years ago
Hello there! That's a tricky issue, it usually happens if the start position of the SV is the same, but the end position differs greatly. I have an example here (cnvnator calls from two individuals):
1 24257001 CNVnator_del_34 N . PASS END=24363000;SVTYPE=DEL;SVLEN=-106000
1 24257001 CNVnator_del_18 N . PASS END=24319000;SVTYPE=DEL;SVLEN=-62000;
These two will not be merged, because their END position differs too much. You may avoid a few of these cases by changing the --overlap and --bnd_distance parameters, but there will always be a risk of getting a few of these. Some --priority flag option sounds like a good idea, I will think about it! In essence we would need some nice way of representing these SV as multiallelic sites.
I just made a fix to SVDB, if your vcf files have "contig entries" in the vcf-header, the chromosomes should be sorted correctly! You are correct, otherwise SVDB will sort the chromosomes by lexiographic order.
Thanks! Happy to hear that the tool works for you! It would be great to know if you find any bugs or ideas of new features! Regards //Jesper
Bump on this one due to a recent MIP issue, although the use case was somewhat unexpected. I could perhaps phrase this as a question: should the docs perhaps state that SVDB merge require input to be split (and normalised)? I have a feeling it won't really handle many-way merging of multi-allelic input entries confidently?
Hi, @J35P312
We also experience a similar problem
We observed that overlapping variants are not merged when trying to merge sequence resolved and non-sequence resolved variants from Truvari.
I think that this is what happens
posB=posA+int(description["SVLEN"])
, but for DEL case SVLEN is negativeposA
and posB
being interchanged.Intuitively, I would change posB=posA+int(description["SVLEN"])
-> posB=posA+abs(int(description["SVLEN"]))
, but maybe you could have a look at it?
Maybe also add support for the cases when sequence resolved variants have neither END nor SVLEN?
Woopsiedaisy! And thanks for your suggestions!
could you send me the variants you are trying to merge? Then I will look in to this!
//JEsper
Thank you for looking into this!
I have tried to make a minimal test set consisting of
I have been able to get the expected overlap by either
Hello! I have had a look, thanks for the test data! I notice you are correct, I have edited the readvcf function to this:
posB=posA+abs(int(description["SVLEN"]))
and now the variants merge nicely!
1 95690510 MantaDEL:manta_95690510|DEL00000389:delly_95690510 AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT A994 PASS SVTYPE=DEL;SVLEN=-319;END=95690830;VARID=DEL00000389:delly_95690510;set=Intersection GT 1/1
Feel free to give this a try! //JEsper
Thank you!
Hi again,
It seems that also INS is affected by a similar issue
posB=END
internally instead of defaulting to posB=POS+abs(SVLEN)
in readVCF.py
posB=POS+abs(SVLEN)
as default (and posB=END
if there is no SVLEN
), the variants will mergeEND=POS+abs(SVLEN)
, which does not work for INS. Again this does not matter when using SVDB for querying, but even after the suggested fix it could potentially lead to some confusion for people investigating the VCF manually.Attached are tree INSes that only merge if I use the suggested fix. I have tested this
Hello! Sorry for the delay! I have had a look, and implemented separate treatment of the INS variants. now PosB=PosA (i.e all ins will have lenght 0 in the reference); in essence, the insertions will be treated as single points. I also added a new parameter "--ins_distance" allowing the insertions to be clustered with a different distance than bnd_distance.
these changes are found in the branch "insertion_distance". Feel free to try it!
To me this solves the issue of END=POS+abs(SVLEN), insertions not being merged properly etc in a good enough way. Ideally, we should merge based on sequence similarity, but this is tricky since different callers report insertions in such different formats.
according to these rules, all the insertions in the test data get merged.
1 95690510 MantaINS:test_ins1|MantaINS:test_ins2|MantaINS:test_ins3 A AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT 994 PASS END=95690510;SVTYPE=INS;SVLEN=319;VARID=MantaINS:test_ins2|MantaINS:test_ins3;set=Intersection GT 1/1 1/1 1/1
Do you agree with these rules? Feel free to comment! I will do some tests on mobile element insertions etc, probably I will merge the insertion_distance by the end of the week!
Thanks!
I think that this solution is a good first approximation. If we are really afraid of over-estimating the frequency, we can use a very small value such as --ins_distance=0
.
Hi there,
I just tested this tool for combining the outputs of Delly, Manta, and TIDDIT. It seems to work great!
I'm not necessarily sure this is an issue and you probably already know -
Here is a command I ran to merge three VCFs for one sample :
svdb --merge --pass_only --no_var --vcf ${dellysv}:delly ${mantasv}:manta ${tidditsv}:tiddit --priority delly,manta,tiddit
Note that I added
--pass_only
and--no_var
after I noticed that when two softwares identify different variants at the same position two lines are output. I'm not sure that I noticed any difference when these flags were there vs when they weren't. From what I can tell, duplicate position lines break downstream processing with bcftools... errors on first occurrence of duplicate record. I haven't tested what would happen if I ranbcftools norm -m +any
on thesvdb --merge
output prior to additional processing, because that is typically used for multi-allelic sites from what I understand.I got around the issue with a "take first line if the next line has the same CHROM POS" awk command, but wanted to let you know. Maybe, the
--priority
flag can have that built in if the user requests it?Another minor thing I noticed is that my chromosome order got rearranged after the
--merge
command - From I,II,III,IV,V,X,MtDNA to I,II,II,IV,MtDNA,V,X, I am assuming this happens because the output sorts the chromosomes by alpha-numeric?Apart from that, this tool seems great! I'll let you know if I notice anything else.
Best, S