J35P312 / SVDB

structural variant database software
MIT License
37 stars 16 forks source link

Double entries from merge command #16

Open Thatguy027 opened 6 years ago

Thatguy027 commented 6 years ago

Hi there,

I just tested this tool for combining the outputs of Delly, Manta, and TIDDIT. It seems to work great!

I'm not necessarily sure this is an issue and you probably already know -

Here is a command I ran to merge three VCFs for one sample :

svdb --merge --pass_only --no_var --vcf ${dellysv}:delly ${mantasv}:manta ${tidditsv}:tiddit --priority delly,manta,tiddit

Note that I added --pass_only and --no_var after I noticed that when two softwares identify different variants at the same position two lines are output. I'm not sure that I noticed any difference when these flags were there vs when they weren't. From what I can tell, duplicate position lines break downstream processing with bcftools... errors on first occurrence of duplicate record. I haven't tested what would happen if I ran bcftools norm -m +any on the svdb --merge output prior to additional processing, because that is typically used for multi-allelic sites from what I understand.

I got around the issue with a "take first line if the next line has the same CHROM POS" awk command, but wanted to let you know. Maybe, the --priority flag can have that built in if the user requests it?

Another minor thing I noticed is that my chromosome order got rearranged after the --merge command - From I,II,III,IV,V,X,MtDNA to I,II,II,IV,MtDNA,V,X, I am assuming this happens because the output sorts the chromosomes by alpha-numeric?

Apart from that, this tool seems great! I'll let you know if I notice anything else.

Best, S

J35P312 commented 6 years ago

Hello there! That's a tricky issue, it usually happens if the start position of the SV is the same, but the end position differs greatly. I have an example here (cnvnator calls from two individuals):

1 24257001 CNVnator_del_34 N . PASS END=24363000;SVTYPE=DEL;SVLEN=-106000 1 24257001 CNVnator_del_18 N . PASS END=24319000;SVTYPE=DEL;SVLEN=-62000;

These two will not be merged, because their END position differs too much. You may avoid a few of these cases by changing the --overlap and --bnd_distance parameters, but there will always be a risk of getting a few of these. Some --priority flag option sounds like a good idea, I will think about it! In essence we would need some nice way of representing these SV as multiallelic sites.

I just made a fix to SVDB, if your vcf files have "contig entries" in the vcf-header, the chromosomes should be sorted correctly! You are correct, otherwise SVDB will sort the chromosomes by lexiographic order.

Thanks! Happy to hear that the tool works for you! It would be great to know if you find any bugs or ideas of new features! Regards //Jesper

dnil commented 4 years ago

Bump on this one due to a recent MIP issue, although the use case was somewhat unexpected. I could perhaps phrase this as a question: should the docs perhaps state that SVDB merge require input to be split (and normalised)? I have a feeling it won't really handle many-way merging of multi-allelic input entries confidently?

sjurug commented 4 years ago

Hi, @J35P312

We also experience a similar problem

We observed that overlapping variants are not merged when trying to merge sequence resolved and non-sequence resolved variants from Truvari.

I think that this is what happens

Intuitively, I would change posB=posA+int(description["SVLEN"]) -> posB=posA+abs(int(description["SVLEN"])), but maybe you could have a look at it? Maybe also add support for the cases when sequence resolved variants have neither END nor SVLEN?

J35P312 commented 4 years ago

Woopsiedaisy! And thanks for your suggestions!

could you send me the variants you are trying to merge? Then I will look in to this!

//JEsper

sjurug commented 4 years ago

Thank you for looking into this!

I have tried to make a minimal test set consisting of

I have been able to get the expected overlap by either

J35P312 commented 4 years ago

Hello! I have had a look, thanks for the test data! I notice you are correct, I have edited the readvcf function to this:

posB=posA+abs(int(description["SVLEN"]))

and now the variants merge nicely!

1 95690510 MantaDEL:manta_95690510|DEL00000389:delly_95690510 AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT A994 PASS SVTYPE=DEL;SVLEN=-319;END=95690830;VARID=DEL00000389:delly_95690510;set=Intersection GT 1/1

Feel free to give this a try! //JEsper

sjurug commented 4 years ago

Thank you!

sjurug commented 2 years ago

Hi again,

It seems that also INS is affected by a similar issue

Attached are tree INSes that only merge if I use the suggested fix. I have tested this

svdb_16_testdata_ins.tar.gz

J35P312 commented 2 years ago

Hello! Sorry for the delay! I have had a look, and implemented separate treatment of the INS variants. now PosB=PosA (i.e all ins will have lenght 0 in the reference); in essence, the insertions will be treated as single points. I also added a new parameter "--ins_distance" allowing the insertions to be clustered with a different distance than bnd_distance.

these changes are found in the branch "insertion_distance". Feel free to try it!

To me this solves the issue of END=POS+abs(SVLEN), insertions not being merged properly etc in a good enough way. Ideally, we should merge based on sequence similarity, but this is tricky since different callers report insertions in such different formats.

according to these rules, all the insertions in the test data get merged.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG002 HG003 HG004

1 95690510 MantaINS:test_ins1|MantaINS:test_ins2|MantaINS:test_ins3 A AAAATAAATATAATTGGCCGGGCGCGGTGGCTCACGCCTGTAATCCCAGCACTTTGGGAGGCCGAGGCGGGCGGATCACGAGGTCAGGAGATCGAGACCATCCCGGCTAAAAAACGGTGAAACCCCGTCTCTACTAAAAATACAAAAAATTAGCCGGGCGTAGTGGCGGGCGCCTGTAGTCCCAGCTACTCGGGAGGCTGAGGCAGGAGAATGGCGTGAACCCGGGAGGCGGAGCTTGCAGTGAGCCGAGATCCCGCCACTGCACTCCAGCCTGGGCGACAGAGCGAGACTCCGTCTCAAAAAAAAAAAAAAAAATAAAT 994 PASS END=95690510;SVTYPE=INS;SVLEN=319;VARID=MantaINS:test_ins2|MantaINS:test_ins3;set=Intersection GT 1/1 1/1 1/1

Do you agree with these rules? Feel free to comment! I will do some tests on mobile element insertions etc, probably I will merge the insertion_distance by the end of the week!

sjurug commented 2 years ago

Thanks! I think that this solution is a good first approximation. If we are really afraid of over-estimating the frequency, we can use a very small value such as --ins_distance=0.