brentp / vcfanno

annotate a VCF with other VCFs/BEDs/tabixed files
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0973-5
MIT License
356 stars 56 forks source link

how to handle multi record of one allele #127

Closed liserjrqlxue closed 3 years ago

liserjrqlxue commented 3 years ago

I'm testing result of vcfanno for databases like hgmd pro which vcf may not be prefect. I found that one variant may have multi record like:

##fileformat=VCFv4.2
##note=VCF file is compatible with VCFv4.3 if required by updating fileformat parameter to v4.3
##copyright=HGMD. Not for redistribution.
##source=HGMD_PRO_2019.3
##reference=GRCh37
##comment="REF and ALT sequences are both on forward strand of reference assembly"
##INFO=<ID=CLASS,Number=1,Type=String,Description="Mutation Category, https://portal.biobase-international.com/hgmd/pro/global.php#cats">
##INFO=<ID=MUT,Number=1,Type=String,Description="HGMD mutant allele">
##INFO=<ID=GENE,Number=1,Type=String,Description="Gene symbol">
##INFO=<ID=STRAND,Number=1,Type=String,Description="Gene strand">
##INFO=<ID=DNA,Number=1,Type=String,Description="DNA annotation">
##INFO=<ID=PROT,Number=1,Type=String,Description="Protein annotation.  The '=' (equals sign) in protein HGVS descriptions have been replaced with '%3D'.  Instructions to deal with such characters with special meaning are given in Section 1.2 of the VCFv4.3 specfication.">
##INFO=<ID=DB,Number=1,Type=String,Description="dbSNP identifier, build 146">
##INFO=<ID=PHEN,Number=1,Type=String,Description="HGMD primary phenotype">
##INFO=<ID=RANKSCORE,Number=1,Type=Float,Description="HGMD computed rankscore.  The HGMD computed rankscore is a probability of pathogenicity between 0 and 1, with 1 being most likely disease-causing compared to other HGMD entries. The score is computed using a machine learning approach, and is based upon multiple lines of evidence, including HGMD literature support for pathogenicity, evolutionary conservation (100 way vertebrate alignment), variant allele frequency and in-silico pathogenicity prediction. Scores may be used to prioritize and rank multiple HGMD variants which have been found in the same sample, so in practise refer to the tag first and then use the rankscore to rank variants in the same variant class (e.g. DM or DM? etc). This feature is under ongoing development.">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
1   11853964    CM135284    C   T   .   .   CLASS=DM;MUT=ALT;GENE=MTHFR;STRAND=-;DNA=NM_005957.4:c.1530G>A;PROT=NP_005948.3:p.K510%3D;DB=rs765586205;PHEN="Homocystinuria"
1   11853964    CS1412548   C   T   .   .   CLASS=DM;MUT=ALT;GENE=MTHFR;STRAND=-;DNA=NM_005957.4:c.1530G>A;PROT=NP_005948.3:p.K510%3D;DB=rs765586205;PHEN="Respiratory_failure_&_hypotonia"

If I use self in case that input vcf may have multi alleles in one poistion, vcfanno only annotate the last record.
If I use concat, vcfanno can annotate all record, but not suitable for multi alleles in one poistion.

Is there a simple way to solve this issue, or should I modify the hgmd vcf?

If you have encountered an error, please include:

brentp commented 3 years ago

if you decompose both the query and annotation vcf, then concat should be fine as it still matches on REF, ALT and POS. Is there a reason that won't work for you?

liserjrqlxue commented 3 years ago

Thanks for your suggest. But the query vcf is directly created from GATK and multi allelic variants indcate genotype is '1/2', so I think modify annotation vcf may be more suitable for me.

brentp commented 3 years ago

if you decompose that GATK vcf with bcftools norm or vt decompose+normalize, then the 1/2 will be turned into two variants, each with genotype ./1 and ./1. most software will then be able to handle that correctly.