RTXteam / RTX

Software repo for Team Expander Agent (Oregon State U., Institute for Systems Biology, and Penn State U.)
MIT License
33 stars 20 forks source link

get "description" field for each node #81

Closed saramsey closed 6 years ago

saramsey commented 6 years ago

this work will be done as a feature request for UpdateNodesInfo.py

saramsey commented 6 years ago

@DeqingQu come see me when you have time, and we can discuss implementation of this feature.

saramsey commented 6 years ago

For some node types, the description field information is already available as a sub-field of the JSON object in "extended_info_json" (protein, microRNA). For the four ontology types that we have (anatomic_feature, biological_process, phenotypic_feature, and disease) we may be able to get a one-paragraph description for each node, from the EBI Ontology Lookup Service (OLS) REST API: https://www.ebi.ac.uk/ols/docs/api

Also please see the following for a code example of RESTfully querying the EBI OLS: https://github.com/RTXteam/RTX/blob/master/code/reasoningtool/kg-construction/QueryEBIOLS.py

saramsey commented 6 years ago

for microRNAs, it may be better to use the "comments" field from mirBase: http://www.mirbase.org/cgi-bin/mirna_entry.pl?acc=MI0000681

DeqingQu commented 6 years ago

Anatomy (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 22/916 = 2.4%

Phenotype (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 2221/10713 = 20.7%

MicroRNA (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 21/1695 = 1.2%

Pathway (done) Source: Reactome https://reactome.org/ContentService Unknown rate: 2/705 = 0.002%

Protein (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 7008/19318 = 36.2%

Disease (done) Source (DOID:xxxx): EBIOLS https://www.ebi.ac.uk/ols/index Source (OMIM:xxxx): OMIM https://www.omim.org/ Unknown rate: 6472/12472 = 51.89%

Biomedical Process (done) Source: EBIOLS https://www.ebi.ac.uk/ols/index Unknown rate: 20/21139 = 0.001%

Chemical Substance (done) Source: MyChem http://mychem.info Unknown rate: 1108/2227 = 49.75%

saramsey commented 6 years ago

@DeqingQu can you please look into this? Can we catch the 502 error in QueryOMIM.py? See line 67 here for example: https://github.com/RTXteam/RTX/blob/master/code/reasoningtool/kg-construction/QueryReactome.py

https://www.ebi.ac.uk/ols/api/ontologies/hp/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FHP_0006963 Status code 404 for url: https://www.ebi.ac.uk/ols/api/ontologies/hp/terms/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FHP_0006963 Status code 502 for URL: https://api.omim.org/api/entry?mimNumber=614747&include=text:description&format=json Traceback (most recent call last): File "UpdateNodesInfo.py", line 376, in UpdateNodesInfo.update_disease_nodes_desc() File "UpdateNodesInfo.py", line 304, in update_disease_nodes_desc node['desc'] = QueryOMIM().disease_mim_to_description(node_id) File "/mnt/data/sramsey/RTX/code/reasoningtool/kg-construction/QueryOMIMExtended.py", line 92, in disease_mim_to_description result_dict = r.json() AttributeError: 'NoneType' object has no attribute 'json' rt@ip-172-31-40-65:~/kg-construction$

DeqingQu commented 6 years ago

502 Error is fixed.

saramsey commented 6 years ago

Hi Steve,

I fixed the 400 bug in QueryOMIMExtended.py. I don’t want to mess up the original QueryOMIM.py, so I copied it to a new file called QueryOMIMExtended.py. There are only two differences between QueryOMIMExtended.py and QueryOMIM.py. 1. 400 bug is fixed in QueryOMIMExtended.py. 2. The requests cache is used in QueryOMIMExtended.py and lru_cache is used in QueryOMIM.py.

The description field updating for chemical substance is done, but about 50% of the results are “UNKNOWN”.

I think it is fine to run UpdateNodesInfo.py now.

Best Regards, Deqing Qu

saramsey commented 6 years ago

From latest KG (dated Friday 4/27), here are the statistics on completeness of the description fields for different node types (100% means every node has a description field that is not "unknown"):

node type percent with description
anatomical_entity 97.6%
biological_process 99.9%
chemical_substance 50.2%
disease (OMIM:) 42.3%
disease (DOID) 60.5%
microRNA 98.8%
protein 63.7%
phenotypic_feature 79.3%
pathway TBD
saramsey commented 6 years ago

dump of latest version of the KG (with the above descriptions) has been pushed to rtxkgdump.saramsey.org:

screen shot 2018-04-30 at 10 08 31 am

edeutsch commented 6 years ago

great! Why did it got a lot smaller?