freebayes / freebayes

Bayesian haplotype-based genetic polymorphism discovery and genotyping.
http://arxiv.org/abs/1207.3907
MIT License
790 stars 265 forks source link

Incongruencies in GT assignment #793

Open Virginia-b opened 7 months ago

Virginia-b commented 7 months ago

I am using Freebayes (v1.3.6) for SNP calling analysis on eggplant population data. While scrutinizing the VCF file, it has come to my attention the heterozygous assignation in some cases. This was the command used to perform the SNP calling:

/usr/bin/freebayes -b all_parents_mapped_BWA_v3_20X_dedup.bam -f /data/users/vbarfon/Genomes/Solgenomics_2019_08_30/Eggplant_V3_Chromosomes.fa -v multithreadfreebayes/subfiles/all_parents_mapped_BWA_v3_20X_dedup_SMEL3Ch01_0_136534347.vcf -r SMEL3Ch01:0-136534347 --limit-coverage 800 --min-coverage 10 -m 20 -q 20

In this case, a homozygous genotype was assigned to sample5, supported by 11 reference reads and 1 alternate read, resulting in an alternate allele frequency of 1/12=0.083. On the other hand, sample8 exhibited a heterozygous genotype, supported by 9 reference reads and 1 alternate read, with an alternate allele frequency of 1/10=0.1. Despite the similarity in alternate allele frequencies between samples, Freebayes assigns different genotypes at the same loci.

/SMEL3Ch01 96806 . TTTGCC CTGGCG 47.5168 . AB=0.185185;ABP=49.4959;AC=3;AF=0.1875;AN=16;AO=12;CIGAR=1X1M1X2M1X;DP=131;DPB=131;DPRA=1.18636;EPP=3.73412;EPPR=3.90444;GTI=1;LEN=6;MEANALT=1;MQM=35.5833;MQMR=60;NS=8;NUMALT=1;ODDS=4.88556;PAIRED=1;PAIREDR=0.915966;PAO=0;PQA=0;PQR=0;PRO=0;QA=465;QR=4692;RO=119;RPL=10;RPP=14.5915;RPPR=4.48836;RPR=2;RUN=1;SAF=9;SAP=9.52472;SAR=3;SRF=58;SRP=3.17453;SRR=61;TYPE=complex GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:31.4607:17:13,4:13:520:4:159:-6.03904,0,-42.0287 0/0:101.192:22:22,0:22:872:0:0:0,-6.62266,-78.7923 0/0:71.0896:12:12,0:12:460:0:0:0,-3.61236,-41.7443 0/0:63.2764:21:20,1:20:792:1:38:0,-2.73405,-68.0036 0/0:51.1134:12:11,1:11:437:1:40:0,-1.51775,-37.5857 0/1:41.2006:27:22,5:22:851:5:187:-7.81267,0,-68.7726 0/0:65.069:10:10,0:10:401:0:0:0,-3.0103,-36.4467 0/1:0.0326615:10:9,1:9:359:1:41:-0.735817,0,-29.6608

Moreover, at other loci, sample6, supported by 11 reference reads and 1 alternate read, exhibits a heterozygous genotype instead of a homozygous one, as exhibited in the previous case by sample5.

/SMEL3Ch01 82135 . A G 885.116 . AB=0.0833333;ABP=21.1059;AC=5;AF=0.3125;AN=16;AO=30;CIGAR=1X;DP=107;DPB=107;DPRA=1.03535;EPP=10.2485;EPPR=11.1604;GTI=1;LEN=1;MEANALT=1;MQM=60;MQMR=59.6104;NS=8;NUMALT=1;ODDS=0.82119;PAIRED=1;PAIREDR=0.922078;PAO=0;PQA=0;PQR=0;PRO=0;QA=1174;QR=2974;RO=77;RPL=19;RPP=7.64277;RPPR=9.35551;RPR=11;RUN=1;SAF=11;SAP=7.64277;SAR=19;SRF=35;SRP=4.39215;SRR=42;TYPE=snp GT:GQ:DP:AD:RO:QR:AO:QA:GL 1/1:30.4148:14:0,14:0:0:14:551:-49.9224,-4.21442,0 0/0:52.3018:15:15,0:15:570:0:0:0,-4.51545,-51.6258 1/1:33.4251:15:0,15:0:0:15:582:-52.7037,-4.51545,0 0/0:40.2606:11:11,0:11:418:0:0:0,-3.31133,-37.9553 0/0:40.2606:11:11,0:11:434:0:0:0,-3.31133,-39.4105 0/1:1.58028:12:11,1:11:398:1:41:-0.482207,0,-32.3809 0/0:58.3224:17:17,0:17:679:0:0:0,-5.11751,-61.4334 0/0:43.2709:12:12,0:12:475:0:0:0,-3.61236,-43.0969

This performance was also observed at higher coverages. For example, in this case, sample1, sample2, sample4, and sample7 carry a heterozygous genotype when their alternate allele frequencies are 4/45=0.088, 4/27=0.148, 6/45=0.133, 3/24=0.125, respectively.

/SMEL3Ch01 96344 . T C 30.2059 . AB=0.120567;ABP=179.331;AC=4;AF=0.25;AN=16;AO=24;CIGAR=1X;DP=242;DPB=242;DPRA=0;EPP=4.45795;EPPR=3.82085;GTI=2;LEN=1;MEANALT=1.125;MQM=45.4583;MQMR=54.682;NS=8;NUMALT=1;ODDS=5.42383;PAIRED=1;PAIREDR=0.949309;PAO=0;PQA=0;PQR=0;PRO=0;QA=966;QR=8706;RO=217;RPL=21;RPP=32.3252;RPPR=3.10036;RPR=3;RUN=1;SAF=15;SAP=6.26751;SAR=9;SRF=114;SRP=4.22112;SRR=103;TYPE=snp GT:GQ:DP:AD:RO:QR:AO:QA:GL 0/1:0.000458462:45:41,4:41:1610:4:164:-1.33939,0,-127.693 0/1:22.7339:27:23,4:23:927:4:164:-6.37746,0,-71.7496 0/0:82.5237:17:16,1:16:643:1:32:0,-1.94407,-52.6745 0/1:29.6502:45:38,6:38:1525:6:246:-8.28092,0,-119.131 0/0:64.3246:27:25,2:25:1013:2:82:0,-0.614318,-82.214 0/0:66.6283:32:29,3:29:1168:3:114:0,-0.956644,-93.4502 0/1:0.0107044:24:21,3:21:853:3:123:-2.78843,0,-68.7308 0/0:96.5417:25:24,1:24:967:1:41:0,-3.63818,-80.9723

Summarizing, those results showed that Freebayes assigns genotypes randomly (two different sites supported by the same number of reads have different genotypes) and it assigns heterozygous genotypes even when the alternate allele frequency is too low, expecting homozygous genotypes at those sites. How can I deal with this issue? The VCF file's head is attached.

data_20X_head.txt

Virginia-b commented 7 months ago

It was also asked for help in Biostars: https://www.biostars.org/p/9591810/