biothings / mygeneset.info

Apache License 2.0
5 stars 3 forks source link

CTD Genesets #23

Closed dongbohu closed 3 years ago

dongbohu commented 3 years ago

This PR adds CTD geneset data plugin.

ravila4 commented 3 years ago

I think there's a couple of things to improve before merging:

  1. Avoid using capital letters for field names because API filters are case-sensitive. We can use snake_case for consistency.
  2. "ChemicalID" is probably a better field than "ChemicalName" for the primary "_id".
  3. It seems like InteractionActions can be split to a list? But I'm not sure how useful this field is, given that the interaction is going to be different for each specific gene in the set.
  4. When I ran the parser, I saw that some genesets have a surprisingly high number of genes that could not be found by mygene.info. Need to investigate more, but here's a case that I saw for zebrafish:
INFO:hub:Building zebrafish genesets (taxid = 7955)
INFO:hub:Querying ['entrezgene', 'retired'] in MyGene.info ...
INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.16ms
querying 1-1000...done.
...
INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.60ms
querying 17001-17735...done.
Finished.
11060 input query terms found no hit:
    ['2', '131076', '14', '16', '18', '163859', '20', '22', '23', '131096', '19', '26', '27', '29', '30'

In this case, I am able to find the missing genes when I don't limit the search to "zebrafish". In fact, most of these seem to be human gene IDs. For example: https://www.ncbi.nlm.nih.gov/gene/?term=131076

dongbohu commented 3 years ago

@ravila4: Thank you for your comments.

  1. Avoid using capital letters for field names because API filters are case-sensitive. We can use snake_case for consistency.

The main reason why I used field names such as ChemicalName (intead of chemical_name) is because the former was used in the CTD data file's header. I can certainly change them to snake_case.

  1. "ChemicalID" is probably a better field than "ChemicalName" for the primary "_id".

You mean changing the _id from 10,11-dihydro-10-hydroxycarbamazepine-9606 to C039775-9606? I thought about it, but if we use the chemical name in _id, wouldn't it be easier for users to search the geneset by chemical?

  1. It seems like InteractionActions can be split to a list? But I'm not sure how useful this field is, given that the interaction is going to be different for each specific gene in the set.

I hesitated about whether this field should be included in the geneset. Yes, it can be a list if you think it is better. I am open to any other options.

  1. When I ran the parser, I saw that some genesets have a surprisingly high number of genes that could not be found by mygene.info. Need to investigate more, but here's a case that I saw for zebrafish: INFO:hub:Building zebrafish genesets (taxid = 7955) INFO:hub:Querying ['entrezgene', 'retired'] in MyGene.info ... INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.16ms querying 1-1000...done. ... INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.60ms querying 17001-17735...done. Finished. 11060 input query terms found no hit: ['2', '131076', '14', '16', '18', '163859', '20', '22', '23', '131096', '19', '26', '27', '29', '30'

Good observation. I noticed this too. Here are some data rows that include the gene 131076:

ChemicalName ChemicalID CasRN GeneSymbol GeneID GeneForms Organism OrganismID Interaction InteractionActions PubMedIDs
1-Methyl-3-isobutylxanthine D015056 28822-58-4 MIX23 131076 mRNA Homo sapiens 9606 .. .. ..
2,4-dinitrotoluene C016403 121-14-2 MIX23 131076 mRNA Rattus norvegicus 10116 .. .. ..
sodium arsenite C017947 13768-07-5 MIX23 131076 mRNA Danio rerio 7955 .. .. ..

As you can see, this gene is involved in organisms 9606 (human), 10116 (rat) and 7955 (zebrafish). I have no idea why.

ravila4 commented 3 years ago

You mean changing the _id from 10,11-dihydro-10-hydroxycarbamazepine-9606 to C039775-9606? I thought about it, but if we use the chemical name in _id, wouldn't it be easier for users to search the geneset by chemical?

The issue with chemical names is that they are very long, hard to spell, and they can have synonyms, for example '1-Methyl-3-isobutylxanthine' can also be called '3-Isobutyl-1-methylxanthine'. I think the best behavior would be if the default search looks at both chemical name and id fields.

If we search for for D015056 on mychem.info: http://mychem.info/v1/query?q=D015056 , you can see that the field 'ChemicalID' in CTD actually correstponds to umls.mesh. A MeSH (Medical Subject Headings) ID is a common id type, so let's rename that field 'mesh' for clarity. Additionally, CasRN (CAS Registry Number) is another useful chemical ID, I think we should include it as cas or and search it by default too.

ravila4 commented 3 years ago

As you can see, this gene is involved in organisms 9606 (human), 10116 (rat) and 7955 (zebrafish). I have no idea why.

My guess is that they are actually looking at the genes are homologous to the human gene 131076 in those organisms, but I need to look more for the explanation

ravila4 commented 3 years ago

I hesitated about whether this field should be included in the geneset. Yes, it can be a list if you think it is better. I am open to any other options.

Let's discard this field. I'm hesitant to add gene-level interaction annotations unless someone has a use case for it.

dongbohu commented 3 years ago

@ravila4 I have addressed your comments. If you get any reply from CTD team, please let me know. Thanks.

ravila4 commented 3 years ago

@dongbohu For translating to a homolog gene ID, we can use HomoloGene data. It's easiest to query the data using mygene.info. For example for the human gene 131076 above:

http://mygene.info/v3/query?q=131076&fields=homologene

gives us the response:

{
  "took": 2,
  "total": 1,
  "max_score": 186.11436,
  "hits": [
    {
      "_id": "131076",
      "_score": 186.11436,
      "homologene": {
        "genes": [
          [
            7165,
            1276103
          ],
          [
            7227,
            40159
          ],
          [
            7955,
            393780
          ],
          [
            8364,
            548996
          ],
          [
            9031,
            418269
          ],
          [
            9544,
            714602
          ],
          [
            9598,
            460629
          ],
          [
            9606,
            131076
          ],
          [
            9615,
            478583
          ],
          [
            9913,
            508149
          ],
          [
            10090,
            381045
          ],
          [
            10116,
            288065
          ]
        ],
        "id": 18028
      }
    }
  ]
}

The lists are key value pairs [taxid, ncbigene] So the corresponding gene for taxid 10116 is 288065.

dongbohu commented 3 years ago

@ravila4 I tested the gene search with homologene. For some species, it improves the search a lot, for example, for taxid 10090 (mouse), only 4648 genes (out of 22091 genes) were found without considering homologene. When including homologene, 19951 genes are found. But for taxid 9823 (pig), no any new genes were found even when homologene is included. Only 11 (out 1016) genes match the this taxid.

I would like to get some confirmations from CTD team (or any other authority) to convince me that homologene is (at least part of ) the reason why some genes don't match the correct taxid in data source.

ravila4 commented 3 years ago

I got a reply for CTD:

At CTD, homologous genes are assimilated into a single gene page, with a single primary accession ID. For example, look at the gene page for MIX23: http://ctdbase.org/detail.go?type=gene&acc=131076

Here, gene MIX23 can be used for any of the listed species. The primary accession ID that identifies this page throughout the relational database is the first accession ID listed (here, 131076): it is highlighted in bold and typically (but not always!) corresponds to the human gene ID.

In downloadable chemical-gene files, we provide: --- chemical name --- chemical ID --- gene symbol --- primary accession ID --- organism for that specific interaction

allowing users to identify the gene and subsequent species in which the interaction was reported.

dongbohu commented 3 years ago

Thank you @ravila4. This is great! I noticed that the number of homologous genes in CTD url is not exactly same as that in mygene.info, but that's okay.

I am going to drop taxid 9823 (pig) in CTD genesets, since so many genes in this organism are missing. The other organisms' genesets are good enough after including homologous genes.

ravila4 commented 3 years ago

I did some spot checks comparing homologene data and CTD, and it looks like CTD provides a much longer list of species. For example, this gene: http://ctdbase.org/detail.go?type=gene&acc=290 vs https://www.ncbi.nlm.nih.gov/homologene?term=290%5BGene%20ID%5D. And as suspected, the pig gene (397520) is not in homologene.

ravila4 commented 3 years ago

Can you post some examples of the genes you found that don't match? I asked CTD whether they have a resource to convert from primary gene id to species-specific IDs.

ravila4 commented 3 years ago

I noticed that the number of homologous genes in CTD url is not exactly same as that in mygene.info, but that's okay.

Sorry, I misunderstood what you meant by this. Yes, I noticed the same thing. I was worried that perhaps the homolog gene from homologene might not match the ID that is in CTD is some cases. If there's no direct way to access the homologs list from CTD, we could write a web scraper as a solution for the pig data, and to check the accuracy of the conversions.

dongbohu commented 3 years ago

Let's see whether CTD has any API that allows us to get the homologene list easily. A web scraper works but will be slow. For example, pig data includes 1016 genes, only 11 of them match pig's taxid in mygene.info; so we have to run the scraper 1005 times to get the homologous genes for each of them.

ravila4 commented 3 years ago

@dongbohu Yes, I agree.

ravila4 commented 3 years ago

CTD support said they don't have a tool for converting between homolog species IDs

dongbohu commented 3 years ago

That's not good. I am reluctant to use the web scraper for a few reasons:

  1. It's slow.
  2. The code will rely on the exact CTD webpage. If the html structure changes, the code may crash completely.
  3. The NCBI Gene IDs section (which has homologene info on each CTD html page) only has gene IDs and organism scientific names (see the example of http://ctdbase.org/detail.go?type=gene&acc=131076). So we have to translate the scientific names back to taxid in order to find the gene ID that we really need. This will make the query even slower.

I tend to drop pig species for now and use whatever homologene information available in mygene.info to get the correct gene IDs. We will lose some genes in some genesets, but according to my statistics, it's acceptable.

ravila4 commented 3 years ago

@dongbohu I agree, let's drop pig for now. Can we check whether any of these have a good number of genes that can be converted? They are also included in some of our other genesets:

"dog": {"taxid": "9615"},
"chicken": {"taxid": "9031"},
"horse": {"taxid": "9796"},
"chimpanzee": {"taxid": "9598"},
"frog": {"taxid": "8364"}

Edit: removed "mosquito", "rice" and "thale-cress", since they don't appear to be in the dataset.

dongbohu commented 3 years ago

Sure. I will test these species too.

dongbohu commented 3 years ago

@ravila4: 3702 (thale-cress), 39947 (rice) and 180454 are not found in CTD data file at all; 9598 (chimpanzee) and 9796 (horse) are not good because of gene missing. So I will add frog, chicken and dog organisms in CTD genesets.

dongbohu commented 3 years ago

@ravila4: I added the homologene and three new species. Please let me know if you have any other questions or concerns. Thanks.

ravila4 commented 3 years ago

@dongbohu It looks good to me. I'll go ahead and merge it. The only thing missing now is the new 'name', 'source', and 'description' fields. But we will do that in another PR.