biothings / mygene.info

MyGene.info: A BioThings API for gene annotations
http://mygene.info
Other
116 stars 20 forks source link

Transcript query returning multiple ensembl gene ids for the same gene #153

Closed sdhutchins closed 2 months ago

sdhutchins commented 2 months ago

Hey, all!

I've been incorporating my gene into a small tool I'm using to retrieve gene-related information and upload into another external tool that requires the ensembl gene id.

When using this for about 15 transcript ids, I came across 2 instances of multiple ensembl gene ids (1 being for the primary assembly).

Below is an example.

MyGene Query URL

https://mygene.info/v3/query?q=ENST00000368358&fields=ensembl&size=10&from=0&fetch_all=false&facet_size=10&entrezonly=false&ensemblonly=false&dotfield=false

MyGene Query Output

{
  "took": 2,
  "total": 1,
  "max_score": 3.789436,
  "hits": [
    {
      "_id": "57657",
      "_score": 3.789436,
      "ensembl": [
        {
          "gene": "ENSG00000143630",
          "protein": "ENSP00000357342",
          "transcript": [
            "ENST00000368358",
            "ENST00000467204",
            "ENST00000492035",
            "ENST00000496230"
          ],
          "translation": [
            {
              "protein": "ENSP00000357342",
              "rna": "ENST00000368358"
            }
          ],
          "type_of_gene": "protein_coding"
        },
        {
          "gene": "ENSG00000263324",
          "protein": "ENSP00000458364",
          "transcript": [
            "ENST00000572822",
            "ENST00000573962",
            "ENST00000575670",
            "ENST00000576844"
          ],
          "translation": [
            {
              "protein": "ENSP00000458364",
              "rna": "ENST00000575670"
            }
          ],
          "type_of_gene": "protein_coding"
        }
      ]
    }
  ]
}

I'm wondering if it's possible to get more information from ensembl's api.

For example, when using their latest api, they return the parent/canonical ensembl gene id.

Query example for a transcript with multiple gene ids: https://rest.ensembl.org/lookup/id/ENST00000555289?content-type=application/json

{
  "logic_name": "havana_homo_sapiens",
  "seq_region_name": "14",
  "object_type": "Transcript",
  "db_type": "core",
  "id": "ENST00000555289",
  "end": 94390635,
  "assembly_name": "GRCh38",
  "length": 609,
  "biotype": "protein_coding_CDS_not_defined",
  "is_canonical": 0,
  "display_name": "SERPINA1-213",
  "species": "homo_sapiens",
  "version": 5,
  "source": "havana",
  "start": 94383629,
  "strand": -1,
  "Parent": "ENSG00000197249"
}

Thanks for any and all help.

Also pinging issue #61 and #137 which are loosely related but may help in this endeavor.

newgene commented 2 months ago

@sdhutchins thanks for reporting this. The root cause of this multiple ensembl genes matching is the due to mapping we obtained from Ensembl. Ensembl mapped two Ensembl genes (ENSG00000143630 and ENSG00000143630) to the same NCBI gene id 57657. That's why you see two items under the ensembl field for this gene object (with 57657 as the primary key at the _id field).

I verified this from Ensembl's BioMart service as well with this query. It returns:

Gene stable ID NCBI gene (formerly Entrezgene) ID
ENSG00000143630 57657
ENSG00000263324 57657

This mapping might change in the future, we will then reflect it in MyGene.info as well.

Having said that, as you suggested, we can look into the option to include some additional fields from Ensembl, which might help us or users to flag the particular matching Ensembl they need, e.g. is_canonical value of 1 or 0 can be used to differentiate two matching Ensembl genes:

https://rest.ensembl.org/lookup/id/ENST00000368358?content-type=application/json (is_canonical: 1) v.s. https://rest.ensembl.org/lookup/id/ENST00000555289?content-type=application/json (is_canonical: 0)

newgene commented 2 months ago

As an intermediate solution, you can potentially filter out the unwanted Ensembl records from the query results, using our recently-added post-processing feature based on JMESPATH. You can include these two parameters to your query:

jmespath=ensembl.transcript|[?contains(@,ENST00000368358)]&jmespath_exclude_empty=true

This should filter out the ensembl record which does not contain ENST00000368358 under ensembl.transcript field. And it should effective serves the purpose for your query.

Hope this helps :-)

sdhutchins commented 2 months ago

Thank you so much for checking into this, @newgene!!!

newgene commented 2 months ago

You are welcome! Closing this issue for now, let us know if you encounter any other issue.