Long complex variant mis calling and wrong depth information

tahuh commented 4 years ago

Hi.

I found a certain variant from our sample and concluded it was originated from other species which was called using VarDict.

chr19 33302104 CCGCCGCCGCCGCCGCCCGTGGGGCCCA TTGCCGCCTCCTGCGGGGGCCG

The alignment is shown below (brackets are the mutated loci bases)

REF - [CC]GCCGCC[G]CC   GC[CGCCCGT]GGGG[C]CC[A]
ALT - [TT]GCCGCC[T]CC[T]GC         GGGG[G]CC[G]

First two Cs are substituted to 2 Ts
Substitution of G>T
T insertion
CGCCCGT deletion (7mers)
Substitution of C>G
Substitution of A>G

To generate BAM file , FASTQ reads generated from contamination source genome using this code with approximately 30X coverage was mapped to human genome.

Using above BAM file, I inspected the exactly same region as IGV image shown below

IGV image confirmed that above variant is true variant originated from contamination source so I looked into the VCF generated by VarDict using same command.

command

VarDict -G ref.fa -q 0 -r 0 -Q 0 -f 0.0 -N sample -b in.bam -c 1 -S 2 -E 3 -G 4 target.bed | teststrandbias.R | var2vcf_valid.pl -N sample -E -f 0.0 > contam.vcf

But the variant generated by VarDict at the same genomic coordinate using this BAM file is differ from the previous one

chr19 33302104 . CC TT 117 NM5.25 SAMPLE=bovine;TYPE=Complex;DP=9;VD=9;AF=1;BIAS=0:0;REFBIAS=0:0;VARBIAS=9:0;PMEAN=41.7;PSTD=1;QUAL=37;QSTD=0;SBF=1;ODDRATIO=0;MQ=60;SN=18;HIAF=1.0000;ADJAF=0;SHIFT3=0;MSI=6;MSILEN=3;NM=6.7;HICNT=9;HICOV=9;LSEQ=CGCCCGGGTAGTCAAAGTCG;RSEQ=GCCGCCGCCGCCGCCCGTGG;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 1/1:9:9:0,9:1:0,0:9,0

Also there is a problem of read depth calculation. As seen in VCF record, read depth (9) is smaller than actual read depth (18). To overcome the internal VarDict's read filtration by (-m) parameter, I set this to 150 and got below which is now over estimates read depths (25 vs 18)

chr19 33302104 . CC TT 154 NM5.25 SAMPLE=bovine;TYPE=Complex;DP=25;VD=18;AF=0.72;BIAS=0:1;REFBIAS=0:0;VARBIAS=18:0;PMEAN=51.5;PSTD=1;QUAL=37;QSTD=0;SBF=1;ODDRATIO=0;MQ=59.8;SN=36;HIAF=0.5455;ADJAF=0;SHIFT3=0;MSI=6;MSILEN=3;NM=8.5;HICNT=18;HICOV=33;LSEQ=CGCCCGGGTAGTCAAAGTCG;RSEQ=GCCGCCGCCGCCGCCCGTGG;DUPRATE=0;SPLITREAD=0;SPANPAIR=0 GT:DP:VD:AD:AF:RD:ALD 1/0:25:18:0,18:0.72:0,0:18,0

Is this issue(depth problem) is related to this issue?

TL;DR

Why VarDict cannot call the same variant?
Why VarDict's estimation of read depth is differ from the exact read depth?

Thanks in advance.

PolinaBevad commented 4 years ago

Hi @tahuh ,

Sorry that you have the issue!

What versions of VarDict were used for processing? This issue isn't related to #264, as there was a problem with SV (depths for them are calculated in a pretty complicated manner). Can you please try to run this region with -p option, does the depth stays wrong then? This option will output all variants (including "reference") and disable filtering.

And is it possible to share the slice of BAM file that you used, so I can analyze the problem more deeply?

Thank you.

tahuh commented 4 years ago

Hi @PolinaBevad

Sorry for my late reply.

The version I'm currently using is 1.7.0 With -p option, same result for depth is generated

I'm attatching BAM file below for the region above that I mentioned.

File contents
- BAM file
- Index file

BAM file and index are created command below

samtools view -bhT reference.fa -o bovine.target.bam sample.bam
samtools index bovine.target.bam

Thanks.

Desktop.zip

PolinaBevad commented 4 years ago

Hi @tahuh,

Thank you for the data!

The complex variant appears, but I think it is filtered out by var2vcf_valid.pl script in your run. VarDict finds two variants on position 33302104 and output by default only the one with the highest AF. Try to add -A option to output all variants on position, i.e.: ... | var2vcf_valid.pl -N sample -E -f 0.0 -A

You must get both variants: CC->TT and CCGCCGCCGCCGCCGCCCGTGGGGCCCA -> TTGCCGCCTCCTGCGGGGGCCG. I will also analyze what the problem is with depth. I see some problems with local realignment (when we combine variant), and I will try to find out what is happening.

PolinaBevad commented 4 years ago

@tahuh , I found what is the reason.

Some reads in the BAM file have soft-clipped bases. VarDict doesn't skip them, but tries to use them when combine long complex variants if S-bases are high-quality. Also we use them in SNV on the end of the reads.

If you add soft-clips in IGV (Preferences-Alignments-Show soft-clipped bases), you can see, that depth is higher because of these S bases:

So I don't think that everything goes wrong. The controversial part can be that we don't increase variant depth for CC->TT, but that's the way how VarDict deals with S-bases.

tahuh commented 4 years ago

@PolinaBevad Thanks for the reply! Now I understood the depth issue but for the variant mentioned above, which should I trust as true variant?

PolinaBevad commented 4 years ago

Hi @tahuh, I think yes, because these soft-clips correspond to the reference and they are all of good quality. Some of them were misaligned by aligner (reads with S-bases at the top on the previous screenshot), but VarDict realigned them.

tahuh commented 4 years ago

Hi @PolinaBevad

Thanks for kind reply :)

My last question is why VarDict gave different variant results for exactly same loci?

My first result was chr19 33302104 CCGCCGCCGCCGCCGCCCGTGGGGCCCA TTGCCGCCTCCTGCGGGGGCCG

The secone one is chr19 33302104 CC TT

I can get both of variants if I use -A option as you mentioned above but I'm a bit curious why I cannot replicate the result with same command?

It will be complicated for downstream analysis to judge if the first one is correct variant or second one is correct when one gets the both variant.

PolinaBevad commented 4 years ago

Hi @tahuh , Sorry for the late answer!

Do you mean that you run VarDict twice with the same command and the results were different? I It can be so only if AF is the same between variants on one position so their order can be changed, but it mustn't be so for these variants. Are you sure that other parameters in command line are the same?

tahuh commented 4 years ago

Hi @PolinaBevad

First I have to make the first question of this thread be clear.

The longer variant was observed when I was look into human genome sequencing data and found a source of contamination originated from bovine using BLAST/BLAT.

To remove reads carrying variants from bovine sequence, I created a simulated reads from bovine genome and mapped this onto human reference and called vardict to list all the possible bovine variants compared to human.

After this I ran vardict exactly the same.

The longer variant was called as complex variant in "a single VCF record line". At the second run, longer variant was separated into several parts(several VCF record lines) and called as shown in the first question.

Does this problem can be arose because of the different situation? If so, could you briefly explain the complex variant calling algorithm of vardict? I read paper but the description was not sufficient enough to understand the situation above.

In my opinion, whether the mixture was 100% or not, the variant caller should find the exactly the same complex variant block.

Consider one sequenced a tissue of certain type of tumor at the same region with different tumor purity such as 10%, 30% and 100%. If the variant caller's algorithm depends on the mixture amount(in this case tumor purity), then above three experiment variant call result might be different.

Can this happen in current algorithm?

Thanks

AstraZeneca-NGS / VarDictJava

Long complex variant mis calling and wrong depth information #282