Ensembl / ensembl-vep

The Ensembl Variant Effect Predictor predicts the functional effects of genomic variants
https://www.ensembl.org/vep
Apache License 2.0
437 stars 149 forks source link

Outdated or missing genes when using RefSeq cache #1659

Open gabrielhim opened 2 months ago

gabrielhim commented 2 months ago

When using RefSeq merged cache for GRCh37, VEP returns outdated symbols of some genes, even though the same genes are reported with the updatd symbol in RefSeq GFF.

For instance, genes IKBKAP and DYX1C1 are previous symbols of ELP1 and DNAAF4 (respectively), as reported in RefSeq 105.20220307 GFF:

$ zgrep annotation-source GCF_000001405.25_GRCh37.p13_genomic.gff.gz 
#!annotation-source NCBI Homo sapiens Updated Annotation Release 105.20220307
$ zgrep -w IKBKAP GCF_000001405.25_GRCh37.p13_genomic.gff.gz 
NC_000009.11    BestRefSeq  gene    111629797   111696404   .   -   .   ID=gene-ELP1;Dbxref=GeneID:8518,HGNC:HGNC:5959,MIM:603722;Name=ELP1;description=elongator acetyltransferase complex subunit 1;gbkey=Gene;gene=ELP1;gene_biotype=protein_coding;gene_synonym=DYS,FD,IKAP,IKBKAP,IKI3,TOT1
$ zgrep -w DYX1C1 GCF_000001405.25_GRCh37.p13_genomic.gff.gz 
NC_000015.9 BestRefSeq  gene    55647421    55790782    .   -   .   ID=gene-DNAAF4-CCPG1;Dbxref=GeneID:100533483,HGNC:HGNC:43019;Name=DNAAF4-CCPG1;description=DNAAF4-CCPG1 readthrough (NMD candidate);gbkey=Gene;gene=DNAAF4-CCPG1;gene_biotype=lncRNA;gene_synonym=DYX1C1-CCPG1
NC_000015.9 BestRefSeq  gene    55709953    55800432    .   -   .   ID=gene-DNAAF4;Dbxref=GeneID:161582,HGNC:HGNC:21493,MIM:608706;Name=DNAAF4;description=dynein axonemal assembly factor 4;gbkey=Gene;gene=DNAAF4;gene_biotype=protein_coding;gene_synonym=CILD25,DYX1,DYX1C1,DYXC1,EKN1,RD

But the cache annotation reports IKBKAP and DYX1C1.

Furthermore, mitochondrial genes prefixed with "ND" (e.g. ND1, ND2, ..., ND6) are not reported by the RefSeq cache. Other genes, such as ATP6 and COX3, are reported.

These behaviours were observed both in command line execution and WEB API.

VCF tested:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample
chrM    4769    .   A   G   .   PASS    .   .   .
chrM    13708   .   G   A   .   PASS    .   .   .
chr9    111659439   .   T   C   50  PASS    .   .   .
chr15   55759193    .   T   C   48.93   PASS    .   .   .

As a comparison, we tested annotation with the GFF directly. Results are reported here.

System

Full VEP command line

Using cache

Command:

docker run --rm -v $PWD:/data ensemblorg/ensembl-vep:release_111.0 vep \
    --input_file /data/samples/check_genes.vcf \
    --assembly GRCh37 \
    --dir_cache /data/ \
    --exclude_predicted \
    --fasta /data/fasta/reference.fa \
    --fork 4 \
    --format vcf \
    --merged \
    --no_stats \
    --offline \
    --symbol \
    --force_overwrite \
    --tab --output_file /data/outputs/check_genes_cache.grch37.tsv

Ouput: check_genes_cache.grch37.txt

Using GFF

Command:

docker run --rm -v $PWD:/data ensemblorg/ensembl-vep:release_111.0 vep \
    --input_file /data/samples/check_genes.vcf \
    --assembly GRCh37 \
    --fasta /data/fasta/reference.fa \
    --format vcf \
    --gff /data/data/105.20220307/out.sorted.gff3.gz \
    --no_stats \
    --symbol \
    --force_overwrite \
    --synonyms /data/data/105.20220307/synonyms.txt \
    --tab --output_file /data/outputs/check_genes_gff.grch37.tsv

Output: check_genes_gff.grch37.txt

Could you help us with that? Is there any information we are missing when using the cache?

Thank you!

dglemos commented 2 months ago

Hi @gabrielhim, The way you are using the cache is correct. Ensembl stopped to update the geneset on GRCh37 at release 75 (Sept 2013), this is why you could be getting older gene symbols. To make sure you annotate the variants with a recent gene symbol you should keep using the GFF file. Let me know if you have more questions.

gabrielhim commented 2 months ago

Hello, @dglemos. Thank you for your reply! We'll consider using the GFF then. Here it is described that RefSeq version in release 111 cache is 105.20220307 (GCF_000001405.25_GRCh37.p13_genomic.gff). Shouldn't that information be updated then? Also, could you confirm if the GRCh38 dataset is up-to-date? Best regards.

dglemos commented 2 months ago

That is correct, for release 111 the bam file GCF_000001405.25_GRCh37.p13_genomic.gff was used. However, BAM files are only used to correct RefSeq genomic mismatches. The data, including the gene symbols, is retrieved from Ensembl.

The GRCh38 geneset was last updated around July 2023.

gabrielhim commented 2 months ago

I see. When using GFF, some transcript biotypes are not supported. We identified these ones not included in the feature_type list:

Is it in your roadmap to include these categories? Can I force their annotation somehow?

Furthermore, some transcripts are skipped with the following message:

WARNING: Unable to determine biotype of NR_003011.1
WARNING: Unable to determine biotype of NR_004388.1
WARNING: Unable to determine biotype of NR_003000.1

Is there a reason for this to happen? Their biotype is ncRNA in the GFF, which is included in the feature_type list. Here is one example in the GFF for reference:

NC_000003.12    BestRefSeq  gene    160514907   160515236   .   -   .   ID=gene-SCARNA7;Dbxref=GeneID:677767,HGNC:HGNC:32563,MIM:615644;Name=SCARNA7;description=small Cajal body-specific RNA 7;gbkey=Gene;gene=SCARNA7;gene_biotype=ncRNA;gene_synonym=U90
NC_000003.12    BestRefSeq  ncRNA   160514907   160515236   .   -   .   ID=rna-NR_003001.1;Parent=gene-SCARNA7;Dbxref=GeneID:677767,GenBank:NR_003001.1,HGNC:HGNC:32563,MIM:615644;Name=NR_003001.1;Note=scaRNA;gbkey=ncRNA;gene=SCARNA7;product=small Cajal body-specific RNA 7;transcript_id=NR_003001.1
NC_000003.12    BestRefSeq  exon    160514907   160515236   .   -   .   ID=exon-NR_003001.1-1;Parent=rna-NR_003001.1;Dbxref=GeneID:677767,GenBank:NR_003001.1,HGNC:HGNC:32563,MIM:615644;Note=scaRNA;gbkey=ncRNA;gene=SCARNA7;product=small Cajal body-specific RNA 7;transcript_id=NR_003001.1

Thank you!

ju-mu commented 4 weeks ago

I would like to emphasize that the mitochondrial genes mentioned above "ND" (e.g. ND1, ND2, ..., ND6) are also missing in the hg38 version of the refseq cache.

If you take e.g. rs193302971 and use the merged cache, ensembl will detect the correct missense variant in ND5 whereas refseq only shows up/down stream gene consequences with source = EntrezGene.

Is there any way to get all transcript_consequences for those genes when using the RefSeq cache?