broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.67k stars 586 forks source link

AS_FilterStatus uses an incorrect number of fields for some records #6857

Open DonFreed opened 3 years ago

DonFreed commented 3 years ago

Bug Report

Affected tool

Mutect2

Affected versions

Description

Mutect2’s header defines AS_FilterStatus as follows:

##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">

AS_FilterStatus uses the pipe character | for per-allele concatenation and a comma , for filter concatenation. This causes records to have an incorrect number of values at sites with multiple filters or multiple alleles. Some examples:

chr1    826950  .       G       T       .       clustered_events;contamination;map_qual;strand_bias    AS_FilterStatus=map_qual,strand_bias,contamination;AS_SB_TABLE=86,101|7,0;DP=199;ECNT=3;GERMQ=93;MBQ=35,34;MFRL=193,211;MMQ=33,27;MPOS=6;NALOD=1.28;NLOD=5.42;POPAF=1.39;ROQ=80;TLOD=8.79       GT:AD:AF:DP:F1R2:F2R1:SB        0/0:36,0:0.05:36:18,0:18,0:24,12,0,0    0/1:151,7:0.055:158:76,4:71,3:62,89,7,0
chr1    3633298 .       GT      G,GTT   .       contamination;multiallelic;normal_artifact;slippage;weak_evidence    AS_FilterStatus=weak_evidence,contamination|weak_evidence,contamination;AS_SB_TABLE=89,7|7,0|7,0;DP=129;ECNT=1;GERMQ=67;MBQ=20,20,20;MFRL=0,0,0;MMQ=60,60,60;MPOS=17,31;NALOD=-0.2424,0.21;NLOD=6.4,6.36;POPAF=2.49,2.04;ROQ=93;RPA=11,10,12;RU=T;STR;STRQ=1;TLOD=3.04,4.6      GT:AD:AF:DP:F1R2:F2R1:SB        0/0:58,4,4:0.072,0.069:66:28,2,2:28,2,2:54,4,8,0        0/1/2:38,3,3:0.083,0.084:44:21,1,3:15,2,0:35,3,6,0

A quick fix would be to define Number=1 for AS_FilterStatus in the VCF header. Alternatively, using a pipe for filter concatenation and a comma for per-allele concatenation might be more compliant with the VCF specification.

droazen commented 3 years ago

@ldgauthier It seems that we might be violating the VCF spec with the way values are delimited in the AS_FilterStatus annotation. How would you feel about correcting this?

ldgauthier commented 3 years ago

@droazen Are you thinking String with count=1 and then DIY parsing? Despite years of discussions, we never made any progress on lists of lists in the VCF spec.

ldgauthier commented 3 years ago

It looks like that's just what Don did in #6858

TnakaNY commented 3 years ago

I am facing same problem when I tried to merge Mutect2 vcf files (from several patients) to one. I tried to use bcftools merge, but I got a same error. I used Mutect2 of GATK4.1.9.0.

From when does Mutect2 have this error?

I am thinking of downgrade GATK tools such as 4.1.7.0 or older version. Does anyone have information? Which GATK4 Mutect2 version should I try?

asntech commented 3 years ago

Is this issue fixed yet? @TnakaNY I am using GATK v4.1.7.0 and still getting the error wrong number of fields in AS_FilterStatus? while merging Mutect2 VCF files from multiple donors using the bcftools merge.

jkobject commented 2 years ago

I am running v4.2.4.0 and still have the issue. It seems that mutect2 is doing the wrong thing of separating the AS_FilterStatus for different allele using | and listing many AS_FilterStatus per alleles using , but the rule is actually the opposite apparently https://github.com/samtools/bcftools/issues/1570

davetang commented 2 years ago

I came across this problem when using bcftools merge on some Mutect2 VCFs and as suggested, a quick fix is to change AS_FilterStatus in the VCF header. However I changed Number to ., which according to the VCF spec:

If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’.

Sharing my code below in case it is useful for others:

bcftools head my.vcf \
   | perl -nle "if (/ID=AS_FilterStatus,/){ s/Number=A/Number=./ } print" > my.header.txt

bcftools reheader -h my.header.txt my.vcf \
   | bgzip > my.vcf.gz
tabix my.vcf.gz
kasia-te commented 2 years ago

I have the same problem and I use sed to change the header line before merging. But could you consider changing the allele separator in this field to , and keep the Number=A in the header to allow for easy splitting of the field with bcftools norm -m, as in the examples below: Number=A - split

zcat test.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    ANNOTATION=aboutT,aboutC        GT      0/2     0/1

bcftools norm -m -any test.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test.vcf.gz; Date=Wed Jun  8 20:51:27 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT  GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutC  GT      0/1     0/0

Number=. - no split

zcat test1.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/2     0/1

bcftools norm -m -any test1.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test1.vcf.gz; Date=Wed Jun  8 21:14:22 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/1     0/0

Best, Kasia

jkobject commented 2 years ago

Yes we made a workflow on Terra to debug this: https://dockstore.org/workflows/github.com/broadinstitute/depmap_omics/omics_mutect2:dev?tab=files this is the fix_mutect2.wdl file I believe

dining-philosopher commented 1 year ago

Hello!

I wrote a script to circumvent this error:

#!/usr/bin/python3

# makes Mutect2 output vcf compliant with VCF specification in case of AS_FilterStatus delimiters at multiallelic loci
# ("wrong number of fields in AS_FilterStatus?")

# https://github.com/broadinstitute/gatk/issues/6857

# usage: cat sample.vcf | correct_mutect.py | bgzip > sample.right.vcf.gz && tabix -p vcf sample.right.vcf.gz

import sys

if __name__ == "__main__":
    for l in sys.stdin:
        if l[0] != "#":
            cols = l.split("\t")
            # if "," in cols[4]: # not sure if the correction is needed at biallelic sites
            if True:
                info = cols[7].split(";")
                for i in range(len(info)):
                    a = info[i]
                    if a[:16] == "AS_FilterStatus=":
                        b = list(map(lambda c: c.split(","), a[16:].split("|")))
                        info[i] = "AS_FilterStatus=" + ",".join(map(lambda c: "|".join(c), b))
                cols = cols[:7] + [";".join(info)] + cols[8:]
                l = "\t".join(cols)
        print(l, end = "")
antonylebechec commented 1 year ago

Hello!

Same problem with comma in Description field in VCF header.

I wrote a awk script to prevent error in further tools that can not deal with this format.

fix_vcf_header_with_comma_in_description.awk

BEGIN {
    sep=";"    # alternative separator 
}
{
    # header line with comma in Description
    if ( ($0 ~ /^##INFO/ || $0 ~ /^##FORMAT/) && $0 ~ /Description=\".*,.*\"/ ) {
        match($0, /^(.*)(Description=\".*\")(.*)$/, arr);
        gsub(",",sep,arr[2]);
        print arr[1] arr[2] arr[3]
    # other lines
    } else {
        print $0
    }
}

Usage:

awk -f fix_vcf_header_with_comma_in_description.awk example.vcf
cat example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
bcftools view example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
dpmccabe commented 2 months ago

A faster awk alternative to the above Python script to swap the characters in the variant rows:

awk '{
    # find the filter status field
    match($0, /AS_FilterStatus=[^;]+;/);

    if (RSTART != 0) {
        # the line matches
        before = substr($0, 1, RSTART - 1);
        match_str = substr($0, RSTART, RLENGTH);
        after = substr($0, RSTART + RLENGTH);

        # temp replace "|" with a non-printing char to swap "|" and "," chars
        gsub(/\|/, "\x1e", match_str);
        gsub(/,/, "|", match_str);
        gsub(/\x1e/, ",", match_str);

        # print modified line
        print before match_str after;
    } else {
        # no match
        print $0;
    }
}' your.vcf > fixed.vcf