Open derekca opened 1 year ago
Bcftools setGT plugin is probably the best way of fixing this issue currently.
bcftools plugin setGT -h
About: Sets genotypes. The target genotypes can be specified as:
./. .. completely missing ("." or "./.", depending on ploidy)
./x .. partially missing (e.g., "./0" or ".|1" but not "./.")
. .. partially or completely missing
a .. all genotypes
b .. heterozygous genotypes failing two-tailed binomial test (example below)
q .. select genotypes using -i/-e options
and the new genotype can be one of:
. .. missing ("." or "./.", keeps ploidy)
0 .. reference allele (e.g. 0/0 or 0, keeps ploidy)
c:GT .. custom genotype (e.g. 0/0, 0, 0/1, m/M, overrides ploidy)
m .. minor (the second most common) allele (e.g. 1/1 or 1, keeps ploidy)
M .. major allele (e.g. 1/1 or 1, keeps ploidy)
p .. phase genotype (0/1 becomes 0|1)
u .. unphase genotype and sort by allele (1|0 becomes 0/1)
Usage: bcftools +setGT [General Options] -- [Plugin Options]
Options:
run "bcftools plugin" for a list of common options
Plugin options:
-e, --exclude <expr> Exclude a genotype if true (requires -t q)
-i, --include <expr> include a genotype if true (requires -t q)
-n, --new-gt <type> Genotypes to set, see above
-t, --target-gt <type> Genotypes to change, see above
Example:
# set missing genotypes ("./.") to phased ref genotypes ("0|0")
bcftools +setGT in.vcf -- -t . -n 0p
# set missing genotypes with DP>0 and GQ>20 to ref genotypes ("0/0")
bcftools +setGT in.vcf -- -t q -n 0 -i 'GT="." && FMT/DP>0 && GQ>20'
# set partially missing genotypes to completely missing
bcftools +setGT in.vcf -- -t ./x -n .
# set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3
bcftools +setGT in.vcf -- -t "b:AD<1e-3" -n 0
# force unphased heterozygous genotype if binom.test(nAlt,nRef+nAlt,0.5)>0.1
bcftools +setGT in.vcf -- -t ./x -n c:'m/M'
I was always wondering if GATK will have a plugin interface where people can code their own using groovy, kotlin, javascript or python plugins to extend some of the functionality where developers may not reach immediately. Personally I use htsjdk extensively (and sometimes pysam) to code a new personal tool each time I need something that I cannot find exactly what I look for. But a generic gatk plugin interface would be really useful and may provide means to extend the community support.
Hey y'all! I think this will be a welcome addition. I did want to point out that to me it also makes sense to force things that have PL=0,0,0 (i.e. a flat genotype likelihood) to also be called ./. in preference to calling it 0/0 for downstream uses. I encountered plenty of cases where DP>0 but PL=0,0,0, so I check for both of those when setting genotypes to ./. in my workflows:
(I also find it useful to count up the number of missing genotypes afterward.)
I try to make a pipeline for GenotypeGVCFs and bcftools
gatk --java-options "-Xmx2048m -XX:ParallelGCThreads=1 -Dsamjdk.compression_level=0" \
GenotypeGVCFs \
-R ./myreference.fa \
-V gendb://mydatabase/ \
-O /dev/stdout | \
bcftools +setGT -Oz -o resutl.vcf.gz - -- -t q -n . -i 'FORMAT/DP=0 | SMPL_MAX(FORMAT/PL)=0'
If you want efficiency - output a new file format, don't output a file labelled VCF that violates the specs
I'm not a GATK user, so feel free to ignore this, but my suggestion, FWIW, is to call your default thing something other than a VCF file, as it breaks the spec. The spec says "if a call cannot be made for a sample at a given locus '.' should be specified".
If you want to claim that it's not that a call cannot be made, you're deliberately making a REF genotype call when you are sure you don't want the user to treat it as a REF call....you could I suppose put something in the FILTER column, which will be fun in a multisample VCF
OR make a major change to the GATK version number and announce loudly that you are breaking the spec and changing policy
Just another comment requesting that either this change be reverted or an option to output a VCF that follows the spec be added. Thank you!
Chiming in here -- the default output for GenotypeGVCFs should be a VCF that follows the spec. An important consideration here is that many users/pipelines usually only perform site level filtering. Genotype level filtering is rarely performed. This causes problems in cases where most genotypes are of high quality at a site and a small number are missing but called as "0/0". Those 0/0 calls would then slip by. Coding missing data as 0/0 also breaks common site level filters such as vcftools' --max-missing, which relies on missing sites being coded per the spec.
I would like to point out that aside from breaking tools and making more work for the many users who want to use the GT field and who will actually be aware of this change and its implications, many users are not particularly sophisticated (no knock on them, in the immortal words of Tim Robinson, "not everybody knows how to do everything") and so problems arising from this are likely to be caught, if at all at the manuscript review level. Reviewer comments to the effect of "go back to your genotype calls, fix them, and then redo everything downstream" are going to be awful for everyone involved to deal with. And this is going to happen a lot.
Just another message reiterating that a VCF should follow the specs of a VCF. Default for missing data should be ./. . Having the default as 0/0, apart from being nonsensical, is likely to generate a lot of downstream errors in many papers.
In addition to breaking compatibility and giving users a new way to blow their feet, using DP=0 to indicate a missing genotype doesn't generally work. Note that with imputation, we can confidently infer a 0/0 genotype even if there are no reads. Given VCFs from a variety of sources, a generic VCF processing tool can't tell whether 0/0 is a non-ALT genotype or missing just based on DP=0.
PS: I understand reverting the change will cause more confusions in the short term, but it will save us a lot more troubles in the long term.
This change was originally made to allow our joint calling pipeline to scale to handle the hundreds of thousands of samples in the AllOfUs project. I agree that it's problematic and confusing for downstream users. Since AllOfUs recently switched back to using the ./. convention in their callsets in response to similar complaints, we are going to revisit this decision in GATK GenotypeGVCFs and strongly consider reverting back to ./. as well in a future release.
This breaks polygenic score pipelines, which have 0/0 without DP (imputed sites). Maybe just revert and include switching 0/0 to ./. as an option
I support the multiple voices raised to revert the change, and not use 0/0 to represent missing genotypes. It breaks the VCF spec to my understanding, and it would be a long-term recipe for confusion and chaos for tools and users.
A great point raised by lh3 and others is that DP=0 cannot be taken to to mean missing genotype, because it is possible to have a high-quality genotype even with no reads (i.e. through imputation).
yet another message to support reverting to good old ./. for missing genotypes and 0/0 for homozygous reference.
Echoing comments above for ./.
missing genotype format. Many commonly used downstream filters expect this format.
Echoing comments above for ./. missing genotype format. Many commonly used downstream filters expect this format.
Still hoping to see a reversal to the old behavior for no-calls!
@mufernando We've merged a series of patches to revert to the previous no-call behavior. These will be part of the next release (GATK 4.6), due soon.
Hi @graphenn,
Could you please explain/justify the use of SMPL_MAX in this line? This would be very helpful, thanks in advance!
'FORMAT/DP=0 | SMPL_MAX(FORMAT/PL)=0'
That code checks whether DP is 0 or the maximum PL value is 0. If any of the conditions is satisfied then plugin will assign nocall ./. to that site.
By overwhelming popular demand, we've switched back to using the standard ./. representation for no-calls in GenotypeGVCFs and GenomicsDB instead of 0/0 with DP=0
Feature request
Tool(s) or class(es) involved
GenotypeGVCFs as of GATK version 4.2.3.0 and newer.
Description
As of GATK version 4.2.3.0, GenotypeGVCFs began representing missing genotypes in the hom-ref genotype
0/0
representation instead of the standard./.
format. I wrote about it on the GATK blog, so I understand the rationale, as well as the fact that users can still determine the missing genotypes because the FORMATDP
field will be0
.However, users in the comments still express frustration in using GenotypeGVCFs outputs because they don't play nice with other tools (like PLINK, for example).
For those users, I am requesting a conversion tool that goes through a GenotypeGVCFs output VCF, and whenever it finds a hom-ref genotype with
DP=0
, it replaces it with the standard./.
representation.Alternately, this could possibly be implemented as an option within GenotypeGVCFs.