broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.67k stars 586 forks source link

Request created from: Monomorphic sites after GenotypeGVCFs --include-non-variant-sites #8030

Open GATKSupportTeam opened 1 year ago

GATKSupportTeam commented 1 year ago

This request was created from a contribution made by Diana Robledo on June 23, 2022 10:56 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/6808932798363-Monomorphic-sites-after-GenotypeGVCFs-include-non-variant-sites

--

Hello,

I am using GATKv4.2.6.1 and GATK best practices. I performed joint genotyping of a multi-sample GVCF with GenotypeGVCFs. Because I am doing a population genetic analysis I am very interested in obtaining high confidence monomorphic sites, so I included the option --include-non-variant-sites. In the output VCF, however, I find that there are 3 types of monomorphic sites, for example:

CHROM                POS  ID   REF               ALT   QUAL     FILTER

HiC_scaffold_493    961    .    A                     .        .              .

HiC_scaffold_493    962    .    ATCTCCCC    .        7.65        LowQual

HiC_scaffold_493    963    .    T                     .        180.56    .

I am not sure what the differences between those 3 types of monomorphic sites are. I tracked down those positions in the input GVCF and they look like this:

CHROM                     POS  ID     REF                    ALT                     QUAL     FILTER

HiC_scaffold_493        961     .       A                        <NON\_REF>        .            . 

HiC_scaffold_493        962     .       ATCTCCCC       A,<NON\_REF>     .            .

HiC_scaffold_493        963     .       T                         *,<NON\_REF>     .            .

I assume that in the GVCF, position 962 had some evidence of the presence of an alternative allele (A) but it was so poor (QUAL < 30) that it was discarded and the position was deemed as monomorphic in the VCF (LowQual). But what about position 963? There was some evidence of a deletion (*) as alternative allele in the GVCF but it got discarded in the VCF despite QUAL = 180.56?

Also, why does position 961 has no QUAL score at all? In fact, these are results from a small scaffold with 1,000 bp, of which 789 monomorphic sites have no QUAL score at all (like position 961).

This might be a rookie question but any help would be much appreciated!

Diana

(created from Zendesk ticket #289449)
gz#289449

axeljen commented 1 year ago

Hi,

I found a similar issue in my vcf files, copying in my question from the gatk forum:

"I came across this thread as I was looking for a solution to a similar problem. I also have a vcf file generated with HaplotypeCaller and GenotypeGVCFs, version 4.2.0.0. In some positions, there is overlap between the reference alleles, much like at position 962-963 in Diana's example, e.g. position 123651 and onwards below.

NC_041772.1 123647 . A . 186.74 . NC_041772.1 123648 . T . 37.61 . NC_041772.1 123649 . T . 185.4 . NC_041772.1 123650 . T . 185.93 . NC_041772.1 123651 . GGCTCTGC . 60.4 . NC_041772.1 123652 . G . 185.19 . NC_041772.1 123653 . C . 188.83 . NC_041772.1 123654 . T . 51.31 . NC_041772.1 123655 . C . 187.97 . NC_041772.1 123656 . T . 185.13 .

I don't have the combined gvcf file stored, but looking at the sample-specific gvcfs I can confirm that the long ref allele at pos 123651 is present only in a single sample's gvcfs file, as a potential deletion (?) supported by a single read if I understand this correctly, and thus rightfully not considered true when genotyped:

NC_041772.1 123651 . GGCTCTGC G, 0 . BaseQRankSum=-2.652;DP=18;ExcessHet=3.0103;MLEAC=0,0;MLEAF=0,0;MQRankSum=0;RAW_MQandDP=64800,18;ReadPosRankSum=-1.351 GT:AD:DP:GQ:PGT:PID:PL:PS:SB 0|0:17,1,0:18:43:0|1:123651_GGCTCTGC_G:0,43,754,51,757,765:123651:8,9,0,1

I apologize if I'm missing something obvious here, and if this is not at all related. But I can't seem to find any reasonable meaning of this ref allele representation, so if this is indeed something real and meaningful I'd be super grateful for some help understanding it. If not, I'm happy to provide any additional info that could help resolving this."

axeljen commented 1 year ago

And as an addition to the above case, I have found this problematic representation of alleles also in variable sites in newly generated vcf files (gatk version 4.3.0.0): NC_041772.1 40006060 . GAC G,AAC NC_041772.1 40006061 . A T,G,C, NC_041772.1 40006062 . C T,