Open GATKSupportTeam opened 2 years ago
Hi,
I found a similar issue in my vcf files, copying in my question from the gatk forum:
"I came across this thread as I was looking for a solution to a similar problem. I also have a vcf file generated with HaplotypeCaller and GenotypeGVCFs, version 4.2.0.0. In some positions, there is overlap between the reference alleles, much like at position 962-963 in Diana's example, e.g. position 123651 and onwards below.
NC_041772.1 123647 . A . 186.74 . NC_041772.1 123648 . T . 37.61 . NC_041772.1 123649 . T . 185.4 . NC_041772.1 123650 . T . 185.93 . NC_041772.1 123651 . GGCTCTGC . 60.4 . NC_041772.1 123652 . G . 185.19 . NC_041772.1 123653 . C . 188.83 . NC_041772.1 123654 . T . 51.31 . NC_041772.1 123655 . C . 187.97 . NC_041772.1 123656 . T . 185.13 .
I don't have the combined gvcf file stored, but looking at the sample-specific gvcfs I can confirm that the long ref allele at pos 123651 is present only in a single sample's gvcfs file, as a potential deletion (?) supported by a single read if I understand this correctly, and thus rightfully not considered true when genotyped:
NC_041772.1 123651 . GGCTCTGC G,
I apologize if I'm missing something obvious here, and if this is not at all related. But I can't seem to find any reasonable meaning of this ref allele representation, so if this is indeed something real and meaningful I'd be super grateful for some help understanding it. If not, I'm happy to provide any additional info that could help resolving this."
And as an addition to the above case, I have found this problematic representation of alleles also in variable sites in newly generated vcf files (gatk version 4.3.0.0): NC_041772.1 40006060 . GAC G,AAC NC_041772.1 40006061 . A T,G,C, NC_041772.1 40006062 . C T,
This request was created from a contribution made by Diana Robledo on June 23, 2022 10:56 UTC.
Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/6808932798363-Monomorphic-sites-after-GenotypeGVCFs-include-non-variant-sites
--
Hello,
I am using GATKv4.2.6.1 and GATK best practices. I performed joint genotyping of a multi-sample GVCF with GenotypeGVCFs. Because I am doing a population genetic analysis I am very interested in obtaining high confidence monomorphic sites, so I included the option --include-non-variant-sites. In the output VCF, however, I find that there are 3 types of monomorphic sites, for example:
CHROM POS ID REF ALT QUAL FILTER
HiC_scaffold_493 961 . A . . .
HiC_scaffold_493 962 . ATCTCCCC . 7.65 LowQual
HiC_scaffold_493 963 . T . 180.56 .
I am not sure what the differences between those 3 types of monomorphic sites are. I tracked down those positions in the input GVCF and they look like this:
CHROM POS ID REF ALT QUAL FILTER
HiC_scaffold_493 961 . A <NON\_REF> . .
HiC_scaffold_493 962 . ATCTCCCC A,<NON\_REF> . .
HiC_scaffold_493 963 . T *,<NON\_REF> . .
I assume that in the GVCF, position 962 had some evidence of the presence of an alternative allele (A) but it was so poor (QUAL < 30) that it was discarded and the position was deemed as monomorphic in the VCF (LowQual). But what about position 963? There was some evidence of a deletion (*) as alternative allele in the GVCF but it got discarded in the VCF despite QUAL = 180.56?
Also, why does position 961 has no QUAL score at all? In fact, these are results from a small scaffold with 1,000 bp, of which 789 monomorphic sites have no QUAL score at all (like position 961).
This might be a rookie question but any help would be much appreciated!
Diana
(created from Zendesk ticket #289449)
gz#289449