Wrong SVLEN values - Githubissues

ricardovialle commented 5 years ago

Dear David,

I'm running SVIM 1.2.0 on some PACBIO samples and I noticed that on a few occasions the SVLEN field is showing wrong values. For example, this tandem duplication should be ~486kb but SVLEN is showing 972kb.

17 44165263 svim.DUP_TAN.511 N <DUP:TANDEM> 39 not_fully_covered SVTYPE=DUP:TANDEM;END=44651666;SVLEN=972806;SUPPORT=32;STD_SPAN=307.46;STD_POS=150.1 GT:DP:AD ./.:.:.,.

For other samples the SVLEN looks fine. Am I missing something?

Thanks!

eldariont commented 5 years ago

Dear Ricardo,

first of all, thanks for reporting this issue. It seems as if SVIM is detecting a tandem duplication with 2 additional copies here. The VCF record expresses that the region 17: 44165263-44651666 with a length of 486kb is duplicated and that 972kb of sequence (twice the length of the original region) are inserted at the end position 44651666. So if we call the 486kb region B and its neighboring regions A and C, then the reference region ABC gets replaced by ABBBC. So far about the interpretation of the VCF record.

But there is something weird going on here. Tandem duplications are quite a special class of SVs and SVIM detects them from split reads with overlapping alignment segments. Let's say we have a read ABCDEFGHIJK and the second part overlaps the first part when aligned on the reference:

REF  ----------------
ALN1             ABC
ALN2       DEFGHIJK

Tandem duplications with copy number 2 (as reported in your record) will look like this:

REF  ----------------
ALN1           ABC
ALN2       DEFGHIJ
ALN3       K

The weird thing is that your record represents a huge duplication that cannot possibly be covered by a single PacBio read. I would like to look into this in more detail but am at a conference currently. I will hopefully have a look next week.

Cheers David

eldariont commented 5 years ago

Hi Ricardo,

sorry for not getting back to this earlier. Does my last post make sense to you and if yes, do you observe wrong SVLEN fields only for tandem duplications or also for other SV classes?

To better understand what is going on in the tandem duplication, it would help a lot if I could have a look at the read alignments in this region. Would it be possible for you to send me a subset of your read alignments, e.g. using samtools view -b <BAM> 17:44165263-44700000 which grabs only alignments from this region? Alternatively, you can run SVIM with the option --read_names which adds the name of the read from which the tandem duplication was called in the VCF output. Then you can have a closer look at the alignments of this particular read to check what might be peculiar about it.

Best David

ricardovialle commented 5 years ago

Hi David,

Thanks a lot for clarifying that. I think that these results make sense, this is a region with large duplications and different copies across the population. I see these "wrong" SVLEN fields only for two tandem duplications (that one with 486kb and another with 268kb, see below). I'm not seeing for other classes of SVs.

17 44165263 svim.DUP_TAN.511 N <DUP:TANDEM> 39 not_fully_covered SVTYPE=DUP:TANDEM;END=44651666;SVLEN=972806;SUPPORT=32;STD_SPAN=307.46;STD_POS=150.1 GT:DP:AD ./.:.:.,. 17 44165276 svim.DUP_TAN.512 N <DUP:TANDEM> 31 not_fully_covered SVTYPE=DUP:TANDEM;END=44433770;SVLEN=536988;SUPPORT=25;STD_SPAN=247.65;STD_POS=126.97 GT:DP:AD ./.:.:.,.

I'm sending you the alignments for this region by email in case you want to take a look. Also, I'm wondering if these cases could be represented as copy numbers (FORMAT:CN) instead of summing the length on the SVLEN field?

Thank you very much for your support!

Cheers, Ricardo

eldariont commented 4 years ago

Hi Ricardo, I finally had time to look at the alignments you sent me. Sorry again for the delay.

468kb duplication For the 468kb duplication, SVIM finds 31 reads that support one additional copy of 286kb. One read (...14454/25928_38829), however, is split in four consecutive pieces A, B, C and D by ngmlr and aligned like this:

17:44165kb        17:44651kb
---B--->          ---A--->
<---C---          <---D---

So both the breaks between A and B and between C and D look like evidence for another duplication copy. Because there are two such pieces of evidence from the same read, SVIM detects a copy number of 2 (it takes the maximum copy number across all reads detecting this duplication). The problem is that both breaks are on opposite DNA strands (A and B on + and C and D on -) which does not make much sense. I should add a filter for this kind of special case to prevent joining of duplication signatures from opposite strands.

Taking all the evidence, I would say it's pretty clear that both loci (17:44651kb and 17:44165kb) are joined together in this genome which is usually caused by a tandem duplication. Do you see an elevated read coverage in this region compared to the surrounding region? There is no good evidence of a second copy in the alignments so I think that the region 17: 44165263-44651666 is only duplicated once.

268kb For the 268kb duplication, SVIM finds 24 reads that support one additional copy of 268kb. Again, one read (...70363/13232_25647) is split in four consecutive pieces in a similar fashion as above so the same applies here.

On your other questions:

Yes, it would be good to output the copy number in the CN field. I will put this on my to-do list.
I think that default parameters with --max_sv_size set to your liking should be fine. Keep in mind, however, that --max_sv_size also serves as a way of filtering unrealistic events such as huge deletions from split reads mapping far apart. Setting this parameter to 10Mb will help you detect large SVs but might also produce some false positives.

I hope this clears up a bit what might be behind the tandem duplication calls you get :)

Cheers David

eldariont / svim

Wrong SVLEN values #24