Closed sdhutchins closed 2 months ago
@sdhutchins thanks for reporting this. The root cause of this multiple ensembl genes matching is the due to mapping we obtained from Ensembl. Ensembl mapped two Ensembl genes (ENSG00000143630
and ENSG00000143630
) to the same NCBI gene id 57657
. That's why you see two items under the ensembl
field for this gene object (with 57657
as the primary key at the _id
field).
I verified this from Ensembl's BioMart service as well with this query. It returns:
Gene stable ID | NCBI gene (formerly Entrezgene) ID |
---|---|
ENSG00000143630 | 57657 |
ENSG00000263324 | 57657 |
This mapping might change in the future, we will then reflect it in MyGene.info as well.
Having said that, as you suggested, we can look into the option to include some additional fields from Ensembl, which might help us or users to flag the particular matching Ensembl they need, e.g. is_canonical
value of 1
or 0
can be used to differentiate two matching Ensembl genes:
https://rest.ensembl.org/lookup/id/ENST00000368358?content-type=application/json (is_canonical: 1
)
v.s.
https://rest.ensembl.org/lookup/id/ENST00000555289?content-type=application/json (is_canonical: 0
)
As an intermediate solution, you can potentially filter out the unwanted Ensembl records from the query results, using our recently-added post-processing feature based on JMESPATH. You can include these two parameters to your query:
jmespath=ensembl.transcript|[?contains(@,ENST00000368358
)]&jmespath_exclude_empty=true
This should filter out the ensembl record which does not contain ENST00000368358
under ensembl.transcript
field. And it should effective serves the purpose for your query.
Hope this helps :-)
Thank you so much for checking into this, @newgene!!!
You are welcome! Closing this issue for now, let us know if you encounter any other issue.
Hey, all!
I've been incorporating my gene into a small tool I'm using to retrieve gene-related information and upload into another external tool that requires the ensembl gene id.
When using this for about 15 transcript ids, I came across 2 instances of multiple ensembl gene ids (1 being for the primary assembly).
Below is an example.
MyGene Query URL
MyGene Query Output
I'm wondering if it's possible to get more information from ensembl's api.
For example, when using their latest api, they return the parent/canonical ensembl gene id.
Query example for a transcript with multiple gene ids: https://rest.ensembl.org/lookup/id/ENST00000555289?content-type=application/json
Thanks for any and all help.
Also pinging issue #61 and #137 which are loosely related but may help in this endeavor.