RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
https://arax.ncats.io/
MIT License
33 stars 20 forks source link

get "description" field for each node #81

Closed saramsey closed 6 years ago

saramsey commented 6 years ago

this work will be done as a feature request for UpdateNodesInfo.py

saramsey commented 6 years ago

@DeqingQu come see me when you have time, and we can discuss implementation of this feature.

saramsey commented 6 years ago

For some node types, the description field information is already available as a sub-field of the JSON object in "extended_info_json" (protein, microRNA). For the four ontology types that we have (anatomic_feature, biological_process, phenotypic_feature, and disease) we may be able to get a one-paragraph description for each node, from the EBI Ontology Lookup Service (OLS) REST API: https://www.ebi.ac.uk/ols/docs/api

Also please see the following for a code example of RESTfully querying the EBI OLS: https://github.com/RTXteam/RTX/blob/master/code/reasoningtool/kg-construction/QueryEBIOLS.py

saramsey commented 6 years ago

for microRNAs, it may be better to use the "comments" field from mirBase: http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000681

DeqingQu commented 6 years ago

Anatomy (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 22/916 = 2.4%

Phenotype (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 2221/10713 = 20.7%

MicroRNA (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 21/1695 = 1.2%

Pathway (done) Source: Reactome https://reactome.org/ContentService Unknown rate: 2/705 = 0.002%

Protein (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 7008/19318 = 36.2%

Disease (done) Source (DOID:xxxx): EBIOLS https://www.ebi.ac.uk/ols/index Source (OMIM:xxxx): OMIM https://www.omim.org/ Unknown rate: 6472/12472 = 51.89%

Biomedical Process (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 20/21139 = 0.001%

Chemical Substance (done) Source: MyChem http://mychem.info Unknown rate: 1108/2227 = 49.75%

saramsey commented 6 years ago

@DeqingQu can you please look into this? Can we catch the 502 error in QueryOMIM.py? See line 67 here for example: https://github.com/RTXteam/RTX/blob/master/code/reasoningtool/kg-construction/QueryReactome.py

https://www.ebi.ac.uk/ols/api/ontologies/hp/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FHP_0006963 Status code 404 for url: https://www.ebi.ac.uk/ols/api/ontologies/hp/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FHP_0006963 Status code 502 for URL: https://api.omim.org/api/entry?mimNumber=614747&include=text:description&format=json Traceback (most recent call last): File "UpdateNodesInfo.py", line 376, in UpdateNodesInfo.update_disease_nodes_desc() File "UpdateNodesInfo.py", line 304, in update_disease_nodes_desc node['desc'] = QueryOMIM().disease_mim_to_description(node_id) File "/mnt/data/sramsey/RTX/code/reasoningtool/kg-construction/QueryOMIMExtended.py", line 92, in disease_mim_to_description result_dict = r.json() AttributeError: 'NoneType' object has no attribute 'json' rt@ip-172-31-40-65:~/kg-construction$

DeqingQu commented 6 years ago

502 Error is fixed.

saramsey commented 6 years ago

Hi Steve,

I fixed the 400 bug in QueryOMIMExtended.py. I don’t want to mess up the original QueryOMIM.py, so I copied it to a new file called QueryOMIMExtended.py. There are only two differences between QueryOMIMExtended.py and QueryOMIM.py. 1. 400 bug is fixed in QueryOMIMExtended.py. 2. The requests cache is used in QueryOMIMExtended.py and lru_cache is used in QueryOMIM.py.

The description field updating for chemical substance is done, but about 50% of the results are “UNKNOWN”.

I think it is fine to run UpdateNodesInfo.py now.

Best Regards, Deqing Qu

saramsey commented 6 years ago

From latest KG (dated Friday 4/27), here are the statistics on completeness of the description fields for different node types (100% means every node has a description field that is not "unknown"):

node type percent with description
anatomical_entity 97.6%
biological_process 99.9%
chemical_substance 50.2%
disease (OMIM:) 42.3%
disease (DOID) 60.5%
microRNA 98.8%
protein 63.7%
phenotypic_feature 79.3%
pathway TBD
saramsey commented 6 years ago

dump of latest version of the KG (with the above descriptions) has been pushed to rtxkgdump.saramsey.org:

screen shot 2018-04-30 at 10 08 31 am

edeutsch commented 6 years ago

great! Why did it got a lot smaller?