The detail version or url for human gene annotation used in annovar

sssimonyang commented 3 years ago

Hi Dr. Wang

The annovar has prepared refGene knownGene ensGene annotation for gene-based annotation. However, the website didn't give the detail verison or fasta, gtf/gff file url used except ensGene.

For ensGene, the website denoted it comes from Gencode v31, when I downloaded the hg19 gtf from gencode.v31lift37.basic.annotation.gtf.gz, some of genes not match with annovar. For example, the ENSG00000278311 in gencode.v31 but ENSG00000005955 in hg19_ensGene.txt, these two are both GGNBP2.

Could you please provide the fasta and gtf file url used in annovar for refGene knownGene and ensGene which is very useful for our additional annotation

Thank you very much!

kaichop commented 3 years ago

The gene annotation is purely based on UCSC Genome Browser, and is not relevant to ENSEMBL. They may have different annotations, but I have never used any Gencode in the compilation of any files. It is a very bad idea to use liftOver for any purpose, unless it is the only way that you want to do quick-and-dirty analysis. The date stamp of the files specifies the version of gene annotation used in annovar, which corresponds to the specific date when gene annotations are downloaded from UCSC, again it is not relevant to gencode/ensembl. You should only use the mRNA FASTA file and refGene.txt or ensGene.txt file within ANNOVAR to ensure consistency of genome annotation.

On Tue, Sep 14, 2021 at 3:05 AM SSSimon Yang @.***> wrote:

Hi Dr. Wang

The annovar has prepared refGene knownGene ensGene annotation for gene-based annotation. However, the website didn't give the detail verison or fasta, gtf/gff file url used except ensGene.

For ensGene, the website denoted it comes from Gencode v31, when I downloaded the hg19 gtf from gencode.v31lift37.basic.annotation.gtf.gz http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/GRCh37_mapping/gencode.v31lift37.basic.annotation.gtf.gz, some of genes not match with annovar. For example, the ENSG00000278311 in gencode.v31 but ENSG00000005955 in hg19_ensGene.txt, these two are both GGNBP2.

Could you please provide the fasta and gtf file url used in annovar for refGene knownGene and ensGene which is very useful for our additional annotation

Thank you very much!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/159, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OEKOBSORWPLOLUB4M3UB3X3VANCNFSM5D7OEQ7Q .

sssimonyang commented 3 years ago

I simply wanted to do a perfect id mapping from ens_gene to gene_symbol for my annotation file. And I didn't plan to do any kinds of liftover.

You said that the gene annotation is solely based on UCSC gtf file. So I downloaded hg19.ensGene.gtf.gz from https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/ And after intersection, all ens genes shown in annotation file can be found in hg19.ensGene.gtf.gz.

However, canonically, a gtf have gene_symbol region gives the corresponding symbol related to ens_gene and hg19.ensGene.gtf just have gene_symbol identical to ens_gene.

Any suggestion for me to do a perfect id mapping without discarding those unmapped in using something like mygene or bitr?

kaichop commented 3 years ago

the ensGene annotation is downloaded from UCSC directly and incorporated into annovar. It is outdated but users have the ability to do this yourself using annovar's retrieve_seq_from_fasta.pl program, given a new ensGene.txt file. It is complicated to map GTF to genome and different people have a different way of achieving this. annovar does not use any GTF and I did not say that ensGene annotation in annovar is based on ucsc gtf file. I think you are trying to relate ENSGxxxx to a gene symbol. This is feasible to do, but impossible to be perfect. What you did below sounds reasonable, and discarding unmapped records would be fine. ensGene is the same as gencode gene nowadays, but they suffer from the same problem of constant gene ID changes, and also one "gene" in gencode can be many different genes in different locations in human genome. There is no perfect one to one matching.

On Thu, Sep 16, 2021 at 10:44 PM SSSimon Yang @.***> wrote:

I simply wanted to do a perfect id mapping from ens_gene to gene_symbol for my annotation file. And I didn't plan to do any kinds of liftover.

You said that the gene annotation is solely based on UCSC gtf file. So I downloaded hg19.ensGene.gtf.gz from https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/ And after intersection, all ens genes shown in annotation file can be found in hg19.ensGene.gtf.gz.

However, canonically, a gtf have gene_symbol region gives the corresponding symbol related to ens_gene and hg19.ensGene.gtf just have gene_symbol identical to ens_gene.

Any suggestion for me to do a perfect id mapping without discarding those unmapped in using something like mygene or bitr?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/WGLab/doc-ANNOVAR/issues/159#issuecomment-921411407, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OGLJ3FHRYTF5V5HQ2DUCKTRHANCNFSM5D7OEQ7Q .

sssimonyang commented 3 years ago

OK. Many thanks for your detailed explanation.

WGLab / doc-ANNOVAR

The detail version or url for human gene annotation used in annovar #159