Using SVIM for the detection of large somatic SVs

hr283 commented 5 years ago

Hi David,

I'm trying to call large (>10kb) somatic structural variants in a tumour sample for which I have ~80X coverage of ONT reads. From previous analysis conducted on this sample, I know there are at least 45 deletions, 10 inversions and 6 tandem duplications, all in the 10kb - 100Mb length range. The allele frequency of somatic variants is about 30% in this sample.

I have run SVIM (v0.5.0) with the following parameters:

svim alignment $WORKSPACE $BAM --sample $SAMPLE --min_mapq 1 --min_sv_size 10000 --max_sv_size 100000000

For the somatic deletions, the recall is not bad: 32 are detected. However only one of the inversions and none of the duplications are included in the results vcf.

I was wondering whether you had an suggestions for tweaking the parameters to improve sensitivity for the detection of large, somatic SVs?

Many thanks, Hannah

eldariont commented 5 years ago

Hi Hannah,

thanks for your message. I'm not completely sure about the reasons for the bad recall but I'll try to give you a few hints and ask a few questions.

Do you know the genomic locations of the SVs you are expecting to find? If yes, it would be interesting to look at the alignments in the region and the SVs called by SVIM there. Can we see read alignments that clearly indicate the missed event or are the alignments hard to interpret?
Do you apply a filtering (e.g. by score) before checking for the SVs you are expecting to find?
Tandem duplications can be detected only when different segments of the same read are aligned overlapping each other on the reference (see schema below). It depends on the aligner whether it splits the reads into such segments or aligns them in one go with a long insertion in the middle. In the second case, tandem duplications will be called as novel insertions. Do you have the chance to look at the alignments over one of the missed duplications? Are the reads split into segments or do they contain an insertion at one of the breakpoints?

REF  -----------------------------
SEG1 --------------->
SEG2          ------------------>
DUP           =======

Another problem could be the --min_sv_size parameter. For duplications, the SV size represents the size of the duplicated element. So if there is an element of 5kp that is duplicated 5 times (total inserted sequence length = 25kb) it will not be called because the element size is below your threshold. I would recommend to use the default value (40bp) for this parameter and to filter for size in post-processing.
Inversions can be detected only when different segments of the same read are aligned in different orientations (see schema below). Again, it would be nice to look at the alignments in one of the regions you are interested in. Are there gaps between the read segments on the read? Or overlaps? Both is possible with Nanopore data due to the relatively high error rate. SVIM has two parameters, --segment_gap_tolerance and --segment_overlap_tolerance that define how much gap or overlap is tolerated. You could try whether setting both to 20bp improves the results.

REF  -----------------------------
SEG1 -------->
SEG2                <----------
INV           =================

I hope that these points are of any help. Yesterday, I also released version 1.0.0 which among other features such as genotyping expands the help texts of the parameters when calling svim --help.

Cheers David

hr283 commented 5 years ago

Hi David,

Thanks very much for your helpful reply. In response to your questions:

Yes, I do know the genomic locations. I have examined the ngmlr alignments in IGV at all of the locations and in all cases the alignments seem to clearly support the event, though in a couple of cases the allele frequency/read support is low.
I had filtered based on q5, but then I went back to the original un-filtered vcf to check whether some of the SVs had been detected but with low quality. All 6 of the duplications were still absent from this vcf, while the remaining 9/10 inversions had been detected but with QUAL score of 0 (see point 5).
Since the tandem duplications I am looking at are large, and the average read length for this sample is only 5-6kb, the aligner has split the reads but on the most part they don't cover the whole duplication. E.g:

REF     -----------------------------
R1SEG1            --->
R1SEG2         --->
DUP            =======

How would these be called by SVIM? I have looked to see whether there are insertions of any type reported by SVIM near the known genomic locations but haven't found any.

That's a helpful point about SV size, thank you. But I believe (from looking at read alignments and copy number profiles), that each of the particular DUPs I am interested in are duplications of an element longer than 10kb. I have already generated calls with --min_sv_size = 50, so I checked for any sign of the duplications being reported in this vcf too and couldn't find them.
Adjusting the --segment_gap_tolerance and --segment_overlap_tolerance seems to have had some effect in terms of increasing read support values for the inversions of interest, however the qual scores for 9/10 of them are still 0. The remaining call has a good quality score and high read support. I have copied below the vcf fields from ALT onwards for the one inversion that passed filtering and two that have good read support but QUAL of 0.
```
<INV>   26      PASS    SVTYPE=INV;END=89185669;SUPPORT=52;STD_SPAN=5.553116608932158;STD_POS=4.109961044686265
<INV>   0       q5      SVTYPE=INV;END=33764220;SUPPORT=26;STD_SPAN=6.622572124066415;STD_POS=2.366106831582534
<INV>   0       q5      SVTYPE=INV;END=64948134;SUPPORT=31;STD_SPAN=20.28766244079953;STD_POS=7.1185431033308 
```
How is the QUAL score calculated for inversions - do you have any idea why it might be so low for these large inversions?

Thank you for updating the help text. I will download the latest release as I would be interested in knowing the genotyping information.

Best wishes, Hannah

eldariont commented 5 years ago

Hi Hannah,

thanks for answering all these questions. I think I know now what might be going on :)

Tandem duplications To call a tandem duplication from your example in point 3, SVIM requires R1SEG1.end > R1SEG2.start (which is true in your case) but also R1SEG2.end > R1SEG1.start. This second requirement is probably the problem because your duplications are so large. I added this second requirement to distinguish tandem duplications from other translocations. Only when the second requirement is met, the read is a clear evidence of a tandem duplication because it covers the same reference region twice. In your case, the reads only support a jump from the end of the duplication to its start. Nevertheless it would be better for large duplications to soften the second requirement. I just implemented this change in a separate branch of this repository. Could you reinstall SVIM with pip like this and rerun it on your data:

git clone https://github.com/eldariont/svim.git
cd svim
git checkout large_tandups
pip3 install --upgrade --user .

Inversions For inversions, SVIM distinguishes four different evidence types:

REF -------------------------------------
INV           =============
EV1     --1-->       <--2--
EV2     <-1---       ---1->
EV3           --2-->       <--1--
EV4           <-1---       ---2->

EV1 and EV2 support the left breakpoint of the inversion and EV3 and EV4 support the right breakpoint. SVIM computes the score of an inversion basically with the formula of min((EV1+EV2), (EV3+EV4)). Could it be that your reads supporting the inversion are all covering only one of the breakpoints? In that case, the score would be 0 despite the high support.

Again, thank you very much for reporting these issues. This kind of feedback is super valuable for me to learn about limitations of my tool and to improve it :)

Best, David

hr283 commented 5 years ago

Hi David,

Sorry for being so slow to get back to you - I was on annual leave for a while and then busy with other projects.

I have run the large_tandups branch on my data and it does now return all the large tandem duplications I was expecting, plus a few more which I will need to assess in IGV. It seems that the end position of the duplication is reported in both the 'POS' column and the END info field - is this by design?

On inversions, if could well be that the reads supporting the inversions are all covering only one of the breakpoints. From memory, some of the inversions have large deletions (>1kb) both sides of them. I will try to get back to you with some more concrete examples.

Thanks again, and glad to be of help!

Hannah

eldariont commented 5 years ago

Hi Hannah,

no worries, I'm glad the detection of tandem duplications works better now. Yes, the POS and END fields are equal by design in an attempt to follow the VCF specification. It says that "For precise variants, END is POS + length of REF allele−1". If the tandem duplication is represented as an insertion, the REF allele is N (N being whatever base comes in front of the insertion) and the ALT allele is N + the inserted sequence. In that case, POS=END. Representing tandem duplications as insertions makes it easier to compare against gold standard sets which distinguish only deletions and insertions. For other use cases, however, it's probably more natural to have the duplicated region as REF allele in which case END=POS+length of duplicated piece. I might need to add a parameter that allows the user to choose between both representations.

I'm looking forward to seeing the inversion examples. David

hr283 commented 5 years ago

Hi David,

Thanks for the explanation of the VCF specification. The other call sets I am using (e.g. Illumina+Manta) had reported END=(POS+length of duplicated piece), and when comparing the VCFs by eye I had been scanning the POS column to see whether the same duplications had been reported, so I initially didn't spot that the duplications were there in the new SVIM calls. However, I can see that it makes just as much sense to report these as insertions occurring at the end of the duplicated section.

I've looked back over the inversions. The one case that was called by SVIM with QUAL>0 is the only 'straightforward' inversion in my data set. The others are part of complex SVs, mostly with deletions either side, for example:

REF --------------------------------------
INV            =============
DEL       =====             =======
EV1 --1-->            <--2--
EV2 <-1---            ---1->
EV3             --2-->             <--1--
EV4             <-1---             ---2->

The deleted sections range in size a lot, from ~100 bases to ~800kb. In one case there is also a deletion within a larger inserted segment (as well as the deletions either side).

I hope that information is helpful.

Best, Hannah

eldariont commented 5 years ago

Hi Hannah,

thanks for looking at the inversions again. The examples made it a lot clearer to me why SVIM had problems with these inversions. I'm thinking whether it would make sense for SVIM to merge overlapping inversion calls (like those from EV1-2 and EV3-4 in your example). Maybe I will add that functionality soon.

By the way: Just a few minutes ago (1d0a831) I added a command-line parameter that let's the user choose how duplications are represented in the VCF. By default, SVIM will now output tandem and interspersed duplications with SVTYPE=DUP and POS and END will point to the start and end of the duplicated region, respectively, as you had expected. When called with --duplications_as_insertions, however, duplications are output as SVTYPE=INS and POS and END will both point to the insertion locus.

I hope that makes working with SVIM more convenient to you. Best, David

eldariont commented 5 years ago

Hi Hannah, I hope it's okay if I close this issue for now. Feel free to reopen if you observe any other problems with SVIM in your dataset. Best David

eldariont / svim

Using SVIM for the detection of large somatic SVs #14