Closed dongbohu closed 3 years ago
I think there's a couple of things to improve before merging:
INFO:hub:Building zebrafish genesets (taxid = 7955)
INFO:hub:Querying ['entrezgene', 'retired'] in MyGene.info ...
INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.16ms
querying 1-1000...done.
...
INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.60ms
querying 17001-17735...done.
Finished.
11060 input query terms found no hit:
['2', '131076', '14', '16', '18', '163859', '20', '22', '23', '131096', '19', '26', '27', '29', '30'
In this case, I am able to find the missing genes when I don't limit the search to "zebrafish". In fact, most of these seem to be human gene IDs. For example: https://www.ncbi.nlm.nih.gov/gene/?term=131076
@ravila4: Thank you for your comments.
- Avoid using capital letters for field names because API filters are case-sensitive. We can use snake_case for consistency.
The main reason why I used field names such as ChemicalName
(intead of chemical_name
) is because the former was used in the CTD data file's header. I can certainly change them to snake_case.
- "ChemicalID" is probably a better field than "ChemicalName" for the primary "_id".
You mean changing the _id
from 10,11-dihydro-10-hydroxycarbamazepine-9606
to C039775-9606
? I thought about it, but if we use the chemical name in _id
, wouldn't it be easier for users to search the geneset by chemical?
- It seems like InteractionActions can be split to a list? But I'm not sure how useful this field is, given that the interaction is going to be different for each specific gene in the set.
I hesitated about whether this field should be included in the geneset. Yes, it can be a list if you think it is better. I am open to any other options.
- When I ran the parser, I saw that some genesets have a surprisingly high number of genes that could not be found by mygene.info. Need to investigate more, but here's a case that I saw for zebrafish: INFO:hub:Building zebrafish genesets (taxid = 7955) INFO:hub:Querying ['entrezgene', 'retired'] in MyGene.info ... INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.16ms querying 1-1000...done. ... INFO:tornado.access:200 GET /job_manager (127.0.0.1) 9.60ms querying 17001-17735...done. Finished. 11060 input query terms found no hit: ['2', '131076', '14', '16', '18', '163859', '20', '22', '23', '131096', '19', '26', '27', '29', '30'
Good observation. I noticed this too. Here are some data rows that include the gene 131076
:
ChemicalName | ChemicalID | CasRN | GeneSymbol | GeneID | GeneForms | Organism | OrganismID | Interaction | InteractionActions | PubMedIDs |
---|---|---|---|---|---|---|---|---|---|---|
1-Methyl-3-isobutylxanthine | D015056 | 28822-58-4 | MIX23 | 131076 | mRNA | Homo sapiens | 9606 | .. | .. | .. |
2,4-dinitrotoluene | C016403 | 121-14-2 | MIX23 | 131076 | mRNA | Rattus norvegicus | 10116 | .. | .. | .. |
sodium arsenite | C017947 | 13768-07-5 | MIX23 | 131076 | mRNA | Danio rerio | 7955 | .. | .. | .. |
As you can see, this gene is involved in organisms 9606 (human), 10116 (rat) and 7955 (zebrafish). I have no idea why.
You mean changing the _id from 10,11-dihydro-10-hydroxycarbamazepine-9606 to C039775-9606? I thought about it, but if we use the chemical name in _id, wouldn't it be easier for users to search the geneset by chemical?
The issue with chemical names is that they are very long, hard to spell, and they can have synonyms, for example '1-Methyl-3-isobutylxanthine' can also be called '3-Isobutyl-1-methylxanthine'. I think the best behavior would be if the default search looks at both chemical name and id fields.
If we search for for D015056 on mychem.info: http://mychem.info/v1/query?q=D015056 , you can see that the field 'ChemicalID' in CTD actually correstponds to umls.mesh
. A MeSH (Medical Subject Headings) ID is a common id type, so let's rename that field 'mesh' for clarity.
Additionally, CasRN (CAS Registry Number) is another useful chemical ID, I think we should include it as cas
or and search it by default too.
As you can see, this gene is involved in organisms 9606 (human), 10116 (rat) and 7955 (zebrafish). I have no idea why.
My guess is that they are actually looking at the genes are homologous to the human gene 131076
in those organisms, but I need to look more for the explanation
I hesitated about whether this field should be included in the geneset. Yes, it can be a list if you think it is better. I am open to any other options.
Let's discard this field. I'm hesitant to add gene-level interaction annotations unless someone has a use case for it.
@ravila4 I have addressed your comments. If you get any reply from CTD team, please let me know. Thanks.
@dongbohu For translating to a homolog gene ID, we can use HomoloGene data. It's easiest to query the data using mygene.info. For example for the human gene 131076 above:
http://mygene.info/v3/query?q=131076&fields=homologene
gives us the response:
{ "took": 2, "total": 1, "max_score": 186.11436, "hits": [ { "_id": "131076", "_score": 186.11436, "homologene": { "genes": [ [ 7165, 1276103 ], [ 7227, 40159 ], [ 7955, 393780 ], [ 8364, 548996 ], [ 9031, 418269 ], [ 9544, 714602 ], [ 9598, 460629 ], [ 9606, 131076 ], [ 9615, 478583 ], [ 9913, 508149 ], [ 10090, 381045 ], [ 10116, 288065 ] ], "id": 18028 } } ] }
The lists are key value pairs [taxid, ncbigene]
So the corresponding gene for taxid 10116
is 288065
.
@ravila4 I tested the gene search with homologene. For some species, it improves the search a lot, for example, for taxid 10090 (mouse), only 4648 genes (out of 22091 genes) were found without considering homologene. When including homologene, 19951 genes are found. But for taxid 9823 (pig), no any new genes were found even when homologene is included. Only 11 (out 1016) genes match the this taxid.
I would like to get some confirmations from CTD team (or any other authority) to convince me that homologene is (at least part of ) the reason why some genes don't match the correct taxid in data source.
I got a reply for CTD:
At CTD, homologous genes are assimilated into a single gene page, with a single primary accession ID. For example, look at the gene page for MIX23: http://ctdbase.org/detail.go?type=gene&acc=131076
Here, gene MIX23 can be used for any of the listed species. The primary accession ID that identifies this page throughout the relational database is the first accession ID listed (here, 131076): it is highlighted in bold and typically (but not always!) corresponds to the human gene ID.
In downloadable chemical-gene files, we provide: --- chemical name --- chemical ID --- gene symbol --- primary accession ID --- organism for that specific interaction
allowing users to identify the gene and subsequent species in which the interaction was reported.
Thank you @ravila4. This is great! I noticed that the number of homologous genes in CTD url is not exactly same as that in mygene.info, but that's okay.
I am going to drop taxid 9823 (pig) in CTD genesets, since so many genes in this organism are missing. The other organisms' genesets are good enough after including homologous genes.
I did some spot checks comparing homologene data and CTD, and it looks like CTD provides a much longer list of species. For example, this gene: http://ctdbase.org/detail.go?type=gene&acc=290 vs https://www.ncbi.nlm.nih.gov/homologene?term=290%5BGene%20ID%5D. And as suspected, the pig gene (397520) is not in homologene.
Can you post some examples of the genes you found that don't match? I asked CTD whether they have a resource to convert from primary gene id to species-specific IDs.
I noticed that the number of homologous genes in CTD url is not exactly same as that in mygene.info, but that's okay.
Sorry, I misunderstood what you meant by this. Yes, I noticed the same thing. I was worried that perhaps the homolog gene from homologene might not match the ID that is in CTD is some cases. If there's no direct way to access the homologs list from CTD, we could write a web scraper as a solution for the pig data, and to check the accuracy of the conversions.
Let's see whether CTD has any API that allows us to get the homologene list easily. A web scraper works but will be slow. For example, pig data includes 1016 genes, only 11 of them match pig's taxid in mygene.info; so we have to run the scraper 1005 times to get the homologous genes for each of them.
@dongbohu Yes, I agree.
CTD support said they don't have a tool for converting between homolog species IDs
That's not good. I am reluctant to use the web scraper for a few reasons:
NCBI Gene IDs
section (which has homologene info on each CTD html page) only has gene IDs and organism scientific names (see the example of http://ctdbase.org/detail.go?type=gene&acc=131076). So we have to translate the scientific names back to taxid in order to find the gene ID that we really need. This will make the query even slower.I tend to drop pig
species for now and use whatever homologene information available in mygene.info to get the correct gene IDs. We will lose some genes in some genesets, but according to my statistics, it's acceptable.
@dongbohu I agree, let's drop pig
for now. Can we check whether any of these have a good number of genes that can be converted? They are also included in some of our other genesets:
"dog": {"taxid": "9615"},
"chicken": {"taxid": "9031"},
"horse": {"taxid": "9796"},
"chimpanzee": {"taxid": "9598"},
"frog": {"taxid": "8364"}
Edit: removed "mosquito", "rice" and "thale-cress", since they don't appear to be in the dataset.
Sure. I will test these species too.
@ravila4: 3702
(thale-cress), 39947
(rice) and 180454
are not found in CTD data file at all; 9598
(chimpanzee) and 9796
(horse) are not good because of gene missing. So I will add frog, chicken and dog organisms in CTD genesets.
@ravila4: I added the homologene and three new species. Please let me know if you have any other questions or concerns. Thanks.
@dongbohu It looks good to me. I'll go ahead and merge it. The only thing missing now is the new 'name', 'source', and 'description' fields. But we will do that in another PR.
This PR adds CTD geneset data plugin.