WGLab / doc-ANNOVAR

Documentation for the ANNOVAR software
http://annovar.openbioinformatics.org
225 stars 347 forks source link

Multiallelic SNPs termed invalid - Unknown function for all "exonic" variants #135

Open AlexFryd opened 3 years ago

AlexFryd commented 3 years ago

Hello Prof. Wang

Thanks for developing ANNOVAR! I am experiencing a small use after successfully performing a gene-based annotation for a number of GWAS variants:

I am using dbSNP151 on the hg38 build to make the gene-based annotations. In total, the ANNOVAR input file contained approximately 1600 SNP associations. The resulting output of gene-based annotations contained 1250 associations, meaning that 350 were considered invalid. Just be checking at the file, I noticed that all invalid SNP entries were multiallelic. Is there something I can do in order to deal with this?

Plus, all 41 "exonic" variants were assigned with an "unknown" impact in the second output file regarding exonic_function. I noticed in the FAQ that this could be attributed to erroneous gene annotations. However, a recent GWAS paper used ANNOVAR for gene-based annotation and they successfully reported a function for the "exonic" variants. Therefore, is there a chance that I did something wrong in building the hg38 db (I had some issues with the file names before successfully running the script.)

Thank you in advance! Best, Alex

kaichop commented 3 years ago

One possibility is that you switched the ref/alt allele to major/minor allele (by default most GWAS software such as plink use this method). Another possibility is that you write something like "G,T" as the alternative allele, which is an invalid allele. It has to be separated into two records. If you give an example line of an "invalid" annotation I can take a look.

On Thu, May 13, 2021 at 7:47 AM AlexFryd @.***> wrote:

Hello Prof. Wang

Thanks for developing ANNOVAR! I am experiencing a small use after successfully performing a gene-based annotation for a number of GWAS variants:

I am using dbSNP151 on the hg38 build to make the gene-based annotations. In total, the ANNOVAR input file contained approximately 1600 SNP associations. The resulting output of gene-based annotations contained 1250 associations, meaning that 350 were considered invalid. Just be checking at the file, I noticed that all invalid SNP entries were multiallelic. Is there something I can do in order to deal with this?

Plus, all 41 "exonic" variants were assigned with an "unknown" impact in the second output file regarding exonic_function. I noticed in the FAQ that this could be attributed to erroneous gene annotations. However, a recent GWAS paper used ANNOVAR for gene-based annotation and they successfully reported a function for the "exonic" variants. Therefore, is there a chance that I did something wrong in building the hg38 db (I had some issues with the file names before successfully running the script.)

Thank you in advance! Best, Alex

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/135, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3ODDTRB7PMVII5LLGALTNO35LANCNFSM442NGXAQ .

AlexFryd commented 3 years ago

Hello Prof. Wang,

Ok I think in that case there is a good chance it is the latter situation:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

4 | 1.09E+08 | 1.09E+08 | A | C,T | Amyotrophic lateral sclerosis (sporadic) | rs10029851 -- | -- | -- | -- | -- | -- | -- 8 | 94979792 | 94979792 | C | A,T | Alzheimer's disease | rs10098778 8 | 83088165 | 83088165 | A | G,T | Age-related cognitive decline (executive function) (slope of z-scores) | rs10107150 9 | 35269822 | 35269822 | T | A,C | Parkinson's disease | rs10121009 14 | 72472787 | 72472787 | C | A,T | Amyotrophic lateral sclerosis (sporadic) | rs10131300 14 | 92074037 | 92074037 | G | A,C | Amyotrophic lateral sclerosis | rs10143310 2 | 1.27E+08 | 1.27E+08 | C | A,T | Alzheimer's disease or family history of Alzheimer's disease | rs10194375 7 | 1.43E+08 | 1.43E+08 | C | A,G,T | Alzheimer's disease or family history of Alzheimer's disease | rs10265814 18 | 59084822 | 59084822 | G | A,C | Alzheimer's disease (age of onset) | rs1037757 19 | 44844996 | 44844996 | C | A,G,T | Alzheimer's disease | rs10426423

Here is the avinput.invalid file

Thanks in advance!

Best, Alex

kaichop commented 3 years ago

Yes, you need to separate this into two different entries, because they are different mutations yet associated with the exactly same rs identifier.

On Mon, May 17, 2021 at 4:52 AM AlexFryd @.***> wrote:

Hello Prof. Wang,

Ok I think in that case there is a good chance it is the latter situation:

4 1.09E+08 1.09E+08 A C,T Amyotrophic lateral sclerosis (sporadic) rs10029851 8 94979792 94979792 C A,T Alzheimer's disease rs10098778 8 83088165 83088165 A G,T Age-related cognitive decline (executive function) (slope of z-scores) rs10107150 9 35269822 35269822 T A,C Parkinson's disease rs10121009 14 72472787 72472787 C A,T Amyotrophic lateral sclerosis (sporadic) rs10131300 14 92074037 92074037 G A,C Amyotrophic lateral sclerosis rs10143310 2 1.27E+08 1.27E+08 C A,T Alzheimer's disease or family history of Alzheimer's disease rs10194375 7 1.43E+08 1.43E+08 C A,G,T Alzheimer's disease or family history of Alzheimer's disease rs10265814 18 59084822 59084822 G A,C Alzheimer's disease (age of onset) rs1037757 19 44844996 44844996 C A,G,T Alzheimer's disease rs10426423 Here is the avinput.invalid file

Thanks in advance!

Best, Alex

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/135#issuecomment-842146011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBDX75NURWKR37FDW3TODKLVANCNFSM442NGXAQ .

AlexFryd commented 3 years ago

Hello Prof. Wang,

Ah ok that is great! Thank you for your help!