biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
115 stars 20 forks source link

NCBI genes map to ensembl genes with invalid identifiers #94

Open dhimmel opened 3 years ago

dhimmel commented 3 years ago

I've noticed three genes where the value for ensembl.gene does not begin with ENSG:

https://mygene.info/v3/gene/263?fields=ensembl
ensembl.gene appears to actually be ENSG00000237801
{"_id": "263", "_version": 1, "ensembl": {"gene": "263", "transcript": "263-1", "translation": [], "type_of_gene": "rRNA"}}

https://mygene.info/v3/gene/55872?fields=ensembl
ensembl.gene appears to actually be ENSG00000168078
{"_id": "55872", "_version": 3, "ensembl": {"gene": "55872", "transcript": "55872-1", "translation": [], "type_of_gene": "tRNA"}}

https://mygene.info/v3/gene/126231?fields=ensembl
ensembl.gene appears to actually be ENSG00000189144
{"_id": "126231", "_version": 2, "ensembl": {"gene": "126231", "transcript": "126231-1", "translation": [], "type_of_gene": "tRNA"}}

In these cases, it seems the value for ensembl.gene has been set to entrezgene (the ncbigene id). Any ideas what the problem is?

kevinxin90 commented 3 years ago

This issue is introduced when we're integrating Metazoa Species data from Ensembl through BioMart.

File path: ensembl_metazoa/49/gene_ensemblgenemain.txt text based search: awk '$2 == "263" { print $0 }' gene_ensemblgenemain.txt returns: 27923 263 rns 3153 3520 Mt 1 rRNA

And since no entrezgene id can be mapped to it. We use it as the _id. And it accidentally aligns with the genedoc with _id:263 from entrez for human species.