eldariont / svim-asm

Structural Variant Identification Method using Genome Assemblies
GNU General Public License v3.0
91 stars 11 forks source link

Low number of interspersed duplications? #6

Open GuillaumeHolley opened 3 years ago

GuillaumeHolley commented 3 years ago

Hi,

I have recently aligned the CHM13 v1.0 assembly from the T2T consortium to the reference genome GRCh38.p13. I subsequently tried to detect SVs from the CHM13 alignment by using the following svim-asm command:

svim-asm haploid --max-sv-size 1000000 --sample CHM13 --reference_gap_tolerance 1000 --reference_overlap_tolerance 1000 --query_gap_tolerance 2000 --query_overlap_tolerance 2000 ./svim_chm13 chm13.bam hg38_p13.fa

This command results in only 4 interspersed duplications which seems quite low to me given the completeness of CHM13. I am also a little bit confused by these duplications: when I look at these 4 duplications, the first 3 of them are ending at the same location on GRCh38.p13 but they have very different lengths (all of them are PASS). Does it mean these are 3 different duplications occurring at different places in CHM13 or is it the same duplication but svim-asm cannot tell which of the 3 duplication lengths is the correct one?

Thank you for the help and the great software!

Guillaume

eldariont commented 3 years ago

Hi Guillaume,

thanks for reporting this issue and sorry for my late reply.

Interspersed duplications in SVIM-asm are defined by a) a source region and b) an insertion location where the additional copy of the source region has been inserted at. By default, interspersed duplications in the output VCF have the SVTYPE=DUP and specify only the source region (defined by POS and END in the VCF record). The three overlapping duplications that you observe in your results are probably different duplications of a similar region that have been inserted at different places in CHM13. Like in the small example of ref=ABCDEFG and assembly=ABCDCEFCG where the C in the ref has been duplicated twice in the assembly.

Like you I am very surprised though that SVIM-asm detects only 4 interspersed duplications between CHM13 and GRCh38. I have a few potential explanations but have started my own quick analyses to confirm. I will let you know as soon as I have found what's going on.

Cheers David