abiswas-odu / Disco

Multi-threaded Distributed Memory Overlap-Layout-Consensus (OLC) Metagenome Assembler
GNU General Public License v3.0
24 stars 1 forks source link

callvariants.sh interpreting very high frequency alleles as low frequency alleles making them absent from VCFs #13

Open OwenDonohoe opened 2 years ago

OwenDonohoe commented 2 years ago

Hi

From looking at my alignment in IGV I can see that there is a 3bp insertion in almost every read at a specific locus (see pic attached)

IGV_screenshot_05-06-22

out of 775 bases at this position 658 are inserts

When I run the following command, both it and many other very high frequency variants are absent from the resulting vcf

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 minreadmapq=20 minscore=5 out=Sample7_min_frac0.5_Mapq20.vcf overwrite=T

I only observe them when I use the clearfilters option

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 clearfilters minreadmapq=20 minscore=5 out=Sample7_min_frac0.5_Mapq20_Readq5_CF.vcf overwrite=T

When I look at the allele frequency for these variants from the second output I can see that callvariants.sh has estimated the frequencies of many variants to be much lower that they actually are, and I suspect that this is why they are missing form the first output

Here is the output for this deletion variant from the second command with cleared filters, AF=0.0194;RAF=0.0212, which is much lower than expected.

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample7_ST-J1_test

NC_019495.1 289042 . A ACAC 5.21 PASS SN=0;STA=289042;STO=289042;TYP=INS;R1P=0;R1M=7;R2P=0;R2M=5;AD=12;DP=618;MCOV=-1;PPC=12;AF=0.0194;RAF=0.0212;LS=1999;MQS=720;MQM=60;BQS=435;BQM=38;EDS=525;EDM=98;IDS=11749;IDM=986;NVC=0;FLG=0;CED=601;HMP=0;SB=0.0703 GT:DP:AD:AF:RAF:NVC:FLG:SB:SC:PF 0:618:12:0.0194:0.0212:0:0:0.0703:5.21:PASS

I do not fully understand why I am seeing such low frequency for almost all of my high frequency variants.

I tried changing minreadmapq=0

callvariants.sh in=Sample7_ST-J1_test.sam ref=../Genomes/ST-J1.fasta minallelefraction=0.50001 rarity=0.50001 minreadmapq=0 minscore=5 out=Sample7_min_frac0.5_Mapq0.vcf overwrite=T

But without "clearfilters" this variant still does not appear in the VCF. The only way I can get it to appear in the VCF without using "clearfilters" is by setting rarity sufficiently low to capture rare variants, but it seems odd that I have to do this to detect what are actually very common variants

Its a viral genome so its small (290kb), so its very easy to see that most of the variants that are supported by a high fraction of the the reads are for some reason absent from the VCF, evidently due to them being interpreted as only being supported by a low fraction of reads. Any advice as to why this is happening would be greatly appreciated. I'm using BBMap v38.90