DecodeGenetics / graphtyper

Population-scale genotyping using pangenome graphs
http://dx.doi.org/10.1038/ng.3964
MIT License
167 stars 20 forks source link

INFO/AAScore using graphtyper 2.7.4 is missing #113

Closed jjfarrell closed 1 year ago

jjfarrell commented 2 years ago

When running Graphtyper 2.7, the info field output included the AAScore field along with the other QC fields. However, when we ran the newer 2.74, the AAScore was not found in the INFO field even though it was described in the VCF Header. Any suggestions with this?

zgrep Vers adsp_13710_pcrfree.graphtyper.chr1.vcf.gz

graphtyperVersion=2.7.4

bcftools_concatVersion=1.10.2+htslib-1.10.2

[farrell@scc-wr3 workarea.ont_manta.adsp_13710_pcrfree]$ zcat adsp_13710_pcrfree.graphtyper.chr1.vcf.gz|grep -v ^#|head|cut -f1-8

chr1    10001   chr1:10001:UG   N       <DUP:SVSIZE=196:AGGREGATED>     3233292 LowPratio       ABHet=0.5848;ABHom=0.9995;AC=27342;AF=0.9972;AN=27420;END=10197;MaxAAS=43;MaxAASR=1;MaxAltPP=0;NHet=32;NHomAlt=13655;NHomRef=23;NUM_MERGED_SVS=3;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=23.33;RELATED_SV_ID=1;RefLen=1;SEQ=AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCC;SVLEN=196;SVMODEL=AGGREGATED;SVSIZE=196;SVTYPE=DUP;SV_ID=0;SeqDepth=316675;VarType=UG
chr1    10001   chr1:10001:UG.0 N       <DUP:SVSIZE=196:BREAKPOINT1>    36613   LowPratio       ABHet=0.3333;ABHom=0.986;AC=2808;AF=0.7883;AN=3562;END=10197;MaxAAS=4;MaxAASR=1;MaxAltPP=4;NHet=20;NHomAlt=1394;NHomRef=12296;NUM_MERGED_SVS=3;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=21.88;RELATED_SV_ID=1;RefLen=1;SEQ=AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCC;SVLEN=196;SVMODEL=BREAKPOINT1;SVSIZE=196;SVTYPE=DUP;SV_ID=0;SeqDepth=2285;VarType=UG
chr1    10001   chr1:10001:UG.1 N       <DUP:SVSIZE=196:BREAKPOINT2>    3106    LowPratio       ABHet=0.2033;ABHom=0.9994;AC=231;AF=0.008739;AN=26434;END=10197;MaxAAS=3;MaxAASR=1;MaxAltPP=3;NHet=159;NHomAlt=36;NHomRef=13515;NUM_MERGED_SVS=3;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=14.58;RELATED_SV_ID=0;RefLen=1;SEQ=AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCC;SVLEN=196;SVMODEL=BREAKPOINT2;SVSIZE=196;SVTYPE=DUP;SV_ID=1;SeqDepth=50612;VarType=UG
chr1    10001   chr1:10001:UG.2 N       <DUP:SVSIZE=196:COVERAGE>       3235545 PASS    ABHet=0.6227;ABHom=0.9994;AC=27373;AF=0.9983;AN=27420;END=10197;MaxAAS=43;MaxAASR=1;MaxAltPP=0;NHet=47;NHomAlt=13663;NHomRef=0;NUM_MERGED_SVS=3;PASS_AC=27127;PASS_AN=27142;PASS_ratio=0.9899;QD=23.31;RELATED_SV_ID=1;RefLen=1;SEQ=AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACCCTAACCCTAACCCTAACCTAACCCTAACCCTAACCC;SVLEN=196;SVMODEL=COVERAGE;SVSIZE=196;SVTYPE=DUP;SV_ID=0;SeqDepth=316924;VarType=UG
chr1    10111   chr1:10111:DG   N       <DEL:SVSIZE=66:AGGREGATED>      731     LowPratio       ABHet=0.4095;ABHom=0.9999;AC=33;AF=0.001204;AN=27420;END=10177;MaxAAS=9;MaxAASR=1;MaxAltPP=9;NHet=9;NHomAlt=12;NHomRef=13689;NUM_MERGED_SVS=1;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=8.807;RefLen=1;SVLEN=66;SVMODEL=AGGREGATED;SVSIZE=66;SVTYPE=DEL;SV_ID=2;SeqDepth=415129;VarType=DG
chr1    10111   chr1:10111:DG.0 N       <DEL:SVSIZE=66:BREAKPOINT>      251590  LowPratio       ABHet=0.611;ABHom=0.9987;AC=16742;AF=0.9624;AN=17396;END=10177;MaxAAS=9;MaxAASR=1;MaxAltPP=9;NHet=336;NHomAlt=8203;NHomRef=5171;NUM_MERGED_SVS=1;PASS_AC=3;PASS_AN=6;PASS_ratio=0.0003449;QD=13.97;RefLen=1;SVLEN=66;SVMODEL=BREAKPOINT;SVSIZE=66;SVTYPE=DEL;SV_ID=2;SeqDepth=18862;VarType=DG
chr1    10111   chr1:10111:DG.1 N       <DEL:SVSIZE=66:COVERAGE>        124     LowQD;LowPratio ABHet=0.3592;ABHom=0.9998;AC=11;AF=0.0004012;AN=27420;END=10177;MaxAAS=7;MaxAASR=0.4667;MaxAltPP=0;NHet=11;NHomAlt=0;NHomRef=13699;NUM_MERGED_SVS=1;PASS_AC=0;PASS_AN=26414;PASS_ratio=0.9633;QD=2.431;RefLen=1;SVLEN=66;SVMODEL=COVERAGE;SVSIZE=66;SVTYPE=DEL;SV_ID=2;SeqDepth=415257;VarType=DG
chr1    10344   chr1:10344:DG   N       <DEL:SVSIZE=146:AGGREGATED>     19983   LowQD;LowPratio ABHet=0.369;ABHom=0.9895;AC=544;AF=0.01991;AN=27324;END=10490;MaxAAS=30;MaxAASR=0.7857;MaxAltPP=0;NHet=542;NHomAlt=1;NHomRef=13167;NUM_MERGED_SVS=1;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=4.467;RefLen=1;SVLEN=146;SVMODEL=AGGREGATED;SVSIZE=146;SVTYPE=DEL;SV_ID=3;SeqDepth=375179;VarType=DG
chr1    10344   chr1:10344:DG.0 N       <DEL:SVSIZE=146:BREAKPOINT>     0       LowQD;LowQUAL;LowPratio ABHet=-1;ABHom=1;AC=0;AF=0;AN=906;END=10490;MaxAAS=0;MaxAASR=0;MaxAltPP=0;NHet=0;NHomAlt=0;NHomRef=13710;NUM_MERGED_SVS=1;PASS_AC=0;PASS_AN=0;PASS_ratio=0;QD=0;RefLen=1;SVLEN=146;SVMODEL=BREAKPOINT;SVSIZE=146;SVTYPE=DEL;SV_ID=3;SeqDepth=619;VarType=DG
chr1    10344   chr1:10344:DG.1 N       <DEL:SVSIZE=146:COVERAGE>       19986   LowQD   ABHet=0.3688;ABHom=0.9887;AC=545;AF=0.01988;AN=27420;END=10490;MaxAAS=30;MaxAASR=0.7857;MaxAltPP=0;NHet=543;NHomAlt=1;NHomRef=13166;NUM_MERGED_SVS=1;PASS_AC=261;PASS_AN=25714;PASS_ratio=0.9378;QD=4.462;RefLen=1;SVLEN=146;SVMODEL=COVERAGE;SVSIZE=146;SVTYPE=DEL;SV_ID=3;SeqDepth=376397;VarType=DG
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20211111
##source=Graphtyper
##graphtyperVersion=2.7
##graphtyperGitBranch=master
##graphtyperSHA1=ffb1ab2bea804973f197784783707dd46d7d5087
##contig=<ID=chr1,length=248956422>
##contig=<ID=chr2,length=242193529>
##contig=<ID=chr3,length=198295559>
[farrell@scc-wr3 workarea.manta.adsp_pcr_free_1]$ zcat   adsp_pcr_free_1.graphtyper.chr1.graphtyper.chr1.vcf.gz|cut -f1-8|grep -v ^#|head
gzip: adsp_pcr_free_1.graphtyper.chr1.graphtyper.chr1.vcf.gz: No such file or directory
[farrell@scc-wr3 workarea.manta.adsp_pcr_free_1]$ zcat   adsp_pcr_free_1.graphtyper.chr1.vcf.gz|cut -f1-8|grep -v ^#|head
chr1    54719   chr1:54719:UG   N       <DUP:SVSIZE=55:AGGREGATED>      860885  PASS    AAScore=0.2428;ABHet=0.5053;ABHetMulti=0.5053,0.4947;ABHom=0.9977;ABHomMulti=0.9995,0.9976;AC=8787;AF=0.9584;AN=9168;CR=0;CRal=0,0;CRalt=0;END=54720;LOGF=0.9797;MMal=383422,17016;MMalt=0.0126195;MQ=34;MQSal=155211233,2510500;MQalt=4;MQsquared=162917433;MaxAAS=93;MaxAASR=1;MaxAltPP=1;NHet=57;NHomAlt=4365;NHomRef=252;NUM_MERGED_SVS=956;PASS_AC=6187;PASS_AN=6448;PASS_ratio=0.7033;QD=20.04;QDalt=20.04;RELATED_SV_ID=1;RefLen=1;SB=0.4796;SBAlt=0.363;SBF=30338,253;SBF1=16583,140;SBF2=13755,113;SBR=32747,444;SBR1=16548,222;SBR2=16199,222;SDal=1317798,34678;SDalt=0.257181;SEQ=TCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTTTCTT;SVLEN=55;SVMODEL=AGGREGATED;SVSIZE=55;SVTYPE=DUP;SV_ID=0;SeqDepth=141743;VarType=UG
chr1    54719   chr1:54719:UG.0 N       <DUP:SVSIZE=55:BREAKPOINT1>     2676    LowABHet;LowQD  AAScore=0.8248;ABHet=0.1229;ABHetMulti=0.1229,0.8771;ABHom=0.9943;ABHomMulti=0.9943,0.5;AC=206;AF=0.02282;AN=9026;CR=0;CRal=0,0;CRalt=0;END=54720;LOGF=0.002512;MMal=383422,17016;MMalt=2.44132;MQ=50
jjfarrell commented 2 years ago

Looking back at the documentation/code, it looks like the Graphtyper AAScore was originally intended just for SNPs and INDELs and therefore AASCore was intently dropped for SVs in recent versions. The training for the logistic regression model used in AAScore appears based on the GIAB SNV and INDELs. Did I get that right?

To help with SV QC, could a AAScore be developed for the SVs using the GIAB SVs as a training set also? Maybe consider a model for each variant type (SNV, DEL, DUP, INS) to increase the accuracy of the AAScore. For example, GATK pipeline runs two VQSR models trained on SNVs and small INDELs separately with different QC metrics.