Illumina / Nirvana

The nimble & robust variant annotator
https://illumina.github.io/NirvanaDocumentation/
GNU General Public License v3.0
170 stars 44 forks source link

variantFrequencies and alleleDepth are sometimes lost when annotating multi-sample VCF files #80

Closed heseber closed 2 years ago

heseber commented 2 years ago

This issue occurs if:

Example:

chr3    183906045       .       G       T,A     .       PASS    AS_FilterStatus=SITE;AS_SB_TABLE=65,55|62,54;ECNT=1;GERMQ=16;ROQ=93;DP=447;MBQ=20,20,20;MFRL=163,152,166;MMQ=60,60,60;MPOS=39,33;POPAF=7.3,7.3;TLOD=253,344     GT:AD:AF:DP:F1R2:F2R1:SB        0/1:120,116,.:0.479,.:236:45,29,.:73,84,.:65,55,62,54   0/2:64,.,138:.,0.684:202:20,.,55:43,.,83:34,30,72,66 

When annotating single-sample VCF files for these two samples, I get

"samples": [
      {
        "genotype": "0/1",
        "variantFrequencies": [
          0.4915
        ],
        "totalDepth": 236,
        "alleleDepths": [
          120,
          116
        ]
      }
    ],

and

    "samples": [
      {
        "genotype": "0/1",
        "variantFrequencies": [
          0.6832
        ],
        "totalDepth": 202,
        "alleleDepths": [
          64,
          138
        ]
      }
    ],

However, if I annotate a multi-sample VCF file, variantFrequencies and alleleDepths are lost:

  "samples": [
    {
      "genotype": "0/1",
      "totalDepth": 236
    },
    {
      "genotype": "0/2",
      "totalDepth": 202
    },
 ]
rajatshuvro commented 2 years ago

Hi @heseber , Can you also share the single sample VCF lines please?

Thanks Rajat

heseber commented 2 years ago

Hi @rajatshuvro,

Details

As it turns out, Nirvana cannot annotate VCF files with any multiallelic lines correctly. If there is more than one alternative allele for a position, such as A,T, then the AF field is not a single number with the allele frequency, but a list of numbers separated by commas, one number for each alternative allele. Usually samples have only one allele per position, so after merging multiple VCF files into a singe multi-sample VCF file, each AF field for a multiallellic position is a comma-separated list with only one non-empty variant frequency for each sample, whereas all other elements of the comma separated AF list are simple single dots, such as 0.6,., where the first allele has an AF of 0.6, and the second is not present in this sample. It is possible to tell bcftools merge to not create any multiallelic lines but multiple single-allelic lines instead by adding the option -m none. This if fact works with Nirvana, because now there are no multiallelic lines and, therefore, each AF field is a simple number and not a list. So using bcftools merge -m none is a workaround for this issue as long as none of the single-sample input VCF files has multiallelic lines.

Summary

Nirvana does not correctly annotate VCF files with multiallelic lines. Any such VCF file needs to be converted by transforming multiallelic lines into multiple single-allelic lines. Workaround is to use bcftools merge -m none which works as long as none of the input single-sample VCF files has multiallelic lines.

Example

I attach a zip file with example files: issue80_example.zip. Included are:

For each of these files, the output of Nirvana v3.18.1 is included, and the subdirectory samples_merged_no_multiallelic_split contains singe-sample json files obtained by splitting the multi-sample json file for the no-multiallelic VCF.

heseber commented 2 years ago

Just one additional note: when annotating single-sample VCF files that have multiallelic lines, they need to be converted (normalized) by transforming multiallelic lines to multiple single-allelic lines. This can be done with bcftools norm -m- -Oz -o output.vcf.gz input.vcf.gz

rajatshuvro commented 2 years ago

Hello @heseber , thanks for the detailed bug report. Here is the ticket for the issue: https://nirvana-annotator.atlassian.net/browse/NIR-1040.

Best Rajat

rajatshuvro commented 2 years ago

Has been fixed in the current develop branch. Will be formally available in the next release.