broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.68k stars 587 forks source link

GenotypeGVCFs: Add ability to convert current (0/0) genotype format to standard (./.) format #8328

Open derekca opened 1 year ago

derekca commented 1 year ago

Feature request

Tool(s) or class(es) involved

GenotypeGVCFs as of GATK version 4.2.3.0 and newer.

Description

As of GATK version 4.2.3.0, GenotypeGVCFs began representing missing genotypes in the hom-ref genotype 0/0 representation instead of the standard ./. format. I wrote about it on the GATK blog, so I understand the rationale, as well as the fact that users can still determine the missing genotypes because the FORMAT DP field will be 0.

However, users in the comments still express frustration in using GenotypeGVCFs outputs because they don't play nice with other tools (like PLINK, for example).

For those users, I am requesting a conversion tool that goes through a GenotypeGVCFs output VCF, and whenever it finds a hom-ref genotype with DP=0, it replaces it with the standard ./. representation.

Alternately, this could possibly be implemented as an option within GenotypeGVCFs.

gokalpcelik commented 1 year ago

Bcftools setGT plugin is probably the best way of fixing this issue currently.

bcftools plugin setGT -h      
About: Sets genotypes. The target genotypes can be specified as:
           ./.  .. completely missing ("." or "./.", depending on ploidy)
           ./x  .. partially missing (e.g., "./0" or ".|1" but not "./.")
           .    .. partially or completely missing
           a    .. all genotypes
           b    .. heterozygous genotypes failing two-tailed binomial test (example below)
           q    .. select genotypes using -i/-e options
       and the new genotype can be one of:
           .    .. missing ("." or "./.", keeps ploidy)
           0    .. reference allele (e.g. 0/0 or 0, keeps ploidy)
           c:GT .. custom genotype (e.g. 0/0, 0, 0/1, m/M, overrides ploidy)
           m    .. minor (the second most common) allele (e.g. 1/1 or 1, keeps ploidy)
           M    .. major allele (e.g. 1/1 or 1, keeps ploidy)
           p    .. phase genotype (0/1 becomes 0|1)
           u    .. unphase genotype and sort by allele (1|0 becomes 0/1)
Usage: bcftools +setGT [General Options] -- [Plugin Options]
Options:
   run "bcftools plugin" for a list of common options

Plugin options:
   -e, --exclude <expr>        Exclude a genotype if true (requires -t q)
   -i, --include <expr>        include a genotype if true (requires -t q)
   -n, --new-gt <type>         Genotypes to set, see above
   -t, --target-gt <type>      Genotypes to change, see above

Example:
   # set missing genotypes ("./.") to phased ref genotypes ("0|0")
   bcftools +setGT in.vcf -- -t . -n 0p

   # set missing genotypes with DP>0 and GQ>20 to ref genotypes ("0/0")
   bcftools +setGT in.vcf -- -t q -n 0 -i 'GT="." && FMT/DP>0 && GQ>20'

   # set partially missing genotypes to completely missing
   bcftools +setGT in.vcf -- -t ./x -n .

   # set heterozygous genotypes to 0/0 if binom.test(nAlt,nRef+nAlt,0.5)<1e-3
   bcftools +setGT in.vcf -- -t "b:AD<1e-3" -n 0

   # force unphased heterozygous genotype if binom.test(nAlt,nRef+nAlt,0.5)>0.1
   bcftools +setGT in.vcf -- -t ./x -n c:'m/M'

I was always wondering if GATK will have a plugin interface where people can code their own using groovy, kotlin, javascript or python plugins to extend some of the functionality where developers may not reach immediately. Personally I use htsjdk extensively (and sometimes pysam) to code a new personal tool each time I need something that I cannot find exactly what I look for. But a generic gatk plugin interface would be really useful and may provide means to extend the community support.

eriqande commented 1 year ago

Hey y'all! I think this will be a welcome addition. I did want to point out that to me it also makes sense to force things that have PL=0,0,0 (i.e. a flat genotype likelihood) to also be called ./. in preference to calling it 0/0 for downstream uses. I encountered plenty of cases where DP>0 but PL=0,0,0, so I check for both of those when setting genotypes to ./. in my workflows:

https://github.com/eriqande/mega-non-model-wgs-snakeflow/blob/2e9218e9211fa140bb11b59cb154181b7dc5f205/workflow/rules/calling.smk#L258-L259

(I also find it useful to count up the number of missing genotypes afterward.)

graphenn commented 1 year ago

I try to make a pipeline for GenotypeGVCFs and bcftools

gatk --java-options "-Xmx2048m -XX:ParallelGCThreads=1  -Dsamjdk.compression_level=0" \
    GenotypeGVCFs \
    -R ./myreference.fa \
    -V gendb://mydatabase/ \
    -O /dev/stdout | \
    bcftools +setGT -Oz -o resutl.vcf.gz - -- -t q -n . -i 'FORMAT/DP=0 | SMPL_MAX(FORMAT/PL)=0'
davmlaw commented 7 months ago

If you want efficiency - output a new file format, don't output a file labelled VCF that violates the specs

iqbal-lab commented 7 months ago

I'm not a GATK user, so feel free to ignore this, but my suggestion, FWIW, is to call your default thing something other than a VCF file, as it breaks the spec. The spec says "if a call cannot be made for a sample at a given locus '.' should be specified".

If you want to claim that it's not that a call cannot be made, you're deliberately making a REF genotype call when you are sure you don't want the user to treat it as a REF call....you could I suppose put something in the FILTER column, which will be fun in a multisample VCF

OR make a major change to the GATK version number and announce loudly that you are breaking the spec and changing policy

davidecarlson commented 7 months ago

Just another comment requesting that either this change be reverted or an option to output a VCF that follows the spec be added. Thank you!

ksamuk commented 7 months ago

Chiming in here -- the default output for GenotypeGVCFs should be a VCF that follows the spec. An important consideration here is that many users/pipelines usually only perform site level filtering. Genotype level filtering is rarely performed. This causes problems in cases where most genotypes are of high quality at a site and a small number are missing but called as "0/0". Those 0/0 calls would then slip by. Coding missing data as 0/0 also breaks common site level filters such as vcftools' --max-missing, which relies on missing sites being coded per the spec.

nreid commented 7 months ago

I would like to point out that aside from breaking tools and making more work for the many users who want to use the GT field and who will actually be aware of this change and its implications, many users are not particularly sophisticated (no knock on them, in the immortal words of Tim Robinson, "not everybody knows how to do everything") and so problems arising from this are likely to be caught, if at all at the manuscript review level. Reviewer comments to the effect of "go back to your genotype calls, fix them, and then redo everything downstream" are going to be awful for everyone involved to deal with. And this is going to happen a lot.

Nopaoli commented 7 months ago

Just another message reiterating that a VCF should follow the specs of a VCF. Default for missing data should be ./. . Having the default as 0/0, apart from being nonsensical, is likely to generate a lot of downstream errors in many papers.

lh3 commented 7 months ago

In addition to breaking compatibility and giving users a new way to blow their feet, using DP=0 to indicate a missing genotype doesn't generally work. Note that with imputation, we can confidently infer a 0/0 genotype even if there are no reads. Given VCFs from a variety of sources, a generic VCF processing tool can't tell whether 0/0 is a non-ALT genotype or missing just based on DP=0.

PS: I understand reverting the change will cause more confusions in the short term, but it will save us a lot more troubles in the long term.

droazen commented 7 months ago

This change was originally made to allow our joint calling pipeline to scale to handle the hundreds of thousands of samples in the AllOfUs project. I agree that it's problematic and confusing for downstream users. Since AllOfUs recently switched back to using the ./. convention in their callsets in response to similar complaints, we are going to revisit this decision in GATK GenotypeGVCFs and strongly consider reverting back to ./. as well in a future release.

Fiwx commented 7 months ago

This breaks polygenic score pipelines, which have 0/0 without DP (imputed sites). Maybe just revert and include switching 0/0 to ./. as an option

bergstand commented 7 months ago

I support the multiple voices raised to revert the change, and not use 0/0 to represent missing genotypes. It breaks the VCF spec to my understanding, and it would be a long-term recipe for confusion and chaos for tools and users.

A great point raised by lh3 and others is that DP=0 cannot be taken to to mean missing genotype, because it is possible to have a high-quality genotype even with no reads (i.e. through imputation).

grenaud commented 7 months ago

yet another message to support reverting to good old ./. for missing genotypes and 0/0 for homozygous reference.

hlnorth commented 7 months ago

Echoing comments above for ./. missing genotype format. Many commonly used downstream filters expect this format.

xiekunwhy commented 6 months ago

Echoing comments above for ./. missing genotype format. Many commonly used downstream filters expect this format.

mufernando commented 3 months ago

Still hoping to see a reversal to the old behavior for no-calls!

droazen commented 3 months ago

@mufernando We've merged a series of patches to revert to the previous no-call behavior. These will be part of the next release (GATK 4.6), due soon.

saudat-bio-code commented 3 months ago

Hi @graphenn,

Could you please explain/justify the use of SMPL_MAX in this line? This would be very helpful, thanks in advance!

'FORMAT/DP=0 | SMPL_MAX(FORMAT/PL)=0'

gokalpcelik commented 3 months ago

That code checks whether DP is 0 or the maximum PL value is 0. If any of the conditions is satisfied then plugin will assign nocall ./. to that site.

davmlaw commented 1 month ago

Release 4.6.0.0

By overwhelming popular demand, we've switched back to using the standard ./. representation for no-calls in GenotypeGVCFs and GenomicsDB instead of 0/0 with DP=0