DecodeGenetics / graphtyper

Population-scale genotyping using pangenome graphs
http://dx.doi.org/10.1038/ng.3964
MIT License
166 stars 20 forks source link

better explanation of graphtyper output VCF #104

Open edg1983 opened 2 years ago

edg1983 commented 2 years ago

Hello,

Can you please provide some more details on the format of graphtyper output VCF? I'm interested essentially in 2 points:

  1. Variants definitions. Take the following example: Given my merged sites input from SVIMMER containing these variants
    chr2    122612415       .       T       [chr2:122612659[T       0       .       SVTYPE=BND;MATEID=MantaBND:101955:0:0:5:0:0:1;CIPOS=0,4;HOMLEN=4;HOMSEQ=TGGC;BND_DEPTH=15;MATE_BND_DEPTH=12;NUM_MERGED_SVS=1;STDDEV_POS=0,0
    chr2    122612415       .       TTGGCAATAGCGAGGTAGTGGTAACAGTGGGCTTTGGGCAAGACCCAGTTCTGTGCTGGCCTCTGGTCTGACCCAGCACAGTCCCACTGGTGGTAGCCACAGGGGTGCTTCTGTCACTTTATCCTCAGCTCCAGGCAGCCCAGCAGAGAGAAGGAGATTCCATTTGTTTGGGAAAAAGTAAGAGAACCAGTGTCTCTGCCAGGTAATCTAGAGAATTTTTCCAGATTTTATCCAAGACCACCAAGGTCTTGGGCAAGAACCACAGCATTACTGGGCTTGGAGTGCTCCCTAATGCAGAAATAACTTAGATCACAACATCCAAGACTTTTTTTTTTTTTTTTTTTTGCTATC   T       0       .       END=122612765;SVTYPE=DEL;SVLEN=-350;CIGAR=1M350D;CIPOS=0,3;HOMLEN=3;HOMSEQ=TGG;NUM_MERGED_SVS=1;STDDEV_POS=0,0

The graphtyper output VCF reports

chr2    122612415       chr2:122612415:DG       N       <DEL:SVSIZE=350:AGGREGATED>     1419    PASS    ABHet=0.4462;ABHom=1;AC=11;AF=0.1667;AN=66;END=122612765;MaxAAS=14;MaxAASR=0.5417;MaxAltPP=14;NHet=11;NHomAlt=0;NHomRef=22;NUM_MERGED_SVS=1;PASS_AC=4;PASS_AN=42;PASS_ratio=0.6364;QD=18.13;RefLen=1;SVLEN=350;SVMODEL=AGGREGATED;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=622;VarType=DG;GCF=0.467236
chr2    122612415       chr2:122612415:DG.0     N       <DEL:SVSIZE=350:BREAKPOINT>     1497    PASS    ABHet=0.4293;ABHom=1;AC=12;AF=0.1818;AN=66;END=122612765;MaxAAS=14;MaxAASR=0.7143;MaxAltPP=14;NHet=12;NHomAlt=0;NHomRef=21;NUM_MERGED_SVS=1;PASS_AC=9;PASS_AN=56;PASS_ratio=0.8485;QD=18.89;RefLen=1;SVLEN=350;SVMODEL=BREAKPOINT;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=604;VarType=DG;GCF=0.467236
chr2    122612415       chr2:122612415:DG.1     N       <DEL:SVSIZE=350:COVERAGE>       228     LowQD   ABHet=0.375;ABHom=0.9427;AC=8;AF=0.1212;AN=66;END=122612765;MaxAAS=10;MaxAASR=0.5714;MaxAltPP=0;NHet=8;NHomAlt=0;NHomRef=25;NUM_MERGED_SVS=1;PASS_AC=4;PASS_AN=46;PASS_ratio=0.697;QD=4;RefLen=1;SVLEN=350;SVMODEL=COVERAGE;SVSIZE=350;SVTYPE=DEL;SV_ID=2;SeqDepth=588;VarType=DG;GCF=0.467236
chr2    122612415       chr2:122612415:OG       T       [chr2:122612659[T       678     PASS    ABHet=0.2952;ABHom=1;AC=12;AF=0.1818;AN=66;END=122612415;MaxAAS=10;MaxAASR=0.7143;NHet=12;NHomAlt=0;NHomRef=21;NUM_MERGED_SVS=1;PASS_AC=7;PASS_AN=52;PASS_ratio=0.7879;QD=13.84;RefLen=1;SVMODEL=AGGREGATED;SVTYPE=BND;SV_ID=1;SeqDepth=565;VarType=OG;GCF=0.506173

Can you please explain the meaning of the 3 DEL lines? I think they represent the same call, but using 3 different models for genotyping: AGGREGATED, BREAKPOINT and COVERAGE. This create issues downstream since I have essentially redundant calls. Can you provide any suggestion on how to use / combine these 3 outputs? If I want to obtain a clean list of variants, which of the 3 do you suggest to use?

  1. FILTER tags in the FORMAT fields For filters added to the FORMAT field, the VCF header do not describe the meaning of the various FILTERN tags. Can you provide more details on the meaning of FILTER1, FILTER2, etc. ?
yz42 commented 2 years ago

I have the same question regarding the different "FAILN" tags. So I did a quick check on my VCF file, It seems "FAILN" were assigned based on the GQ value, here is what I found:

PASS: 99 >= GQ > 40 FAIL1: 40 >= GQ > 21 FAIL2: 21 >= GQ > 0 FAIL3: GQ = 0

The author also mentioned in their paper that the high confidence filtration used in graphtyper would bring a big loss of true positives, so they didn't set this filtration as default. It might be more reasonable to filter genotypes based on a certain GQ threshold, instead of filtering out all "failed' records.

I am not sure if the GQ value is the only rule to assign different FAILN tags.

Hope that helps.

jingydz commented 1 year ago

$ bcftools concat test.lst.genotype.vcf/chr21/*.vcf.gz | grep -v ^# | grep -v "LowQUAL" |grep -v "LowQD" |less -SN ... 235 chr21 46237437 chr21:46237437:DG N 255 PASS ABHet=-1;ABHom=1;AC=2;AF=0.5;AN=4;END=46238759;MaxAAS=22;MaxAASR=1;MaxAltPP=0;NHet=0;NHomAlt=1;NHomRef=1;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=2;PASS_ratio= 236 chr21 46237437 chr21:46237437:DG.0 N 255 PASS ABHet=-1;ABHom=1;AC=2;AF=0.5;AN=4;END=46238759;MaxAAS=25;MaxAASR=1;MaxAltPP=25;NHet=0;NHomAlt=1;NHomRef=1;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=4;PASS_ratio 237 chr21 46237437 chr21:46237437:DG.1 N 273 PASS ABHet=0.3125;ABHom=1;AC=3;AF=0.75;AN=4;END=46238759;MaxAAS=22;MaxAASR=1;MaxAltPP=0;NHet=1;NHomAlt=1;NHomRef=0;NUM_MERGED_SVS=1;PASS_AC=2;PASS_AN=2;PASS_r

What if all three lines are passes?