Closed ecwood closed 4 years ago
Thanks, @ericawood for the issue report. I am assuming that Michael meant GO evidence codes, is that right? For GO gene annotations, I believe this is doable.
From what I understood, Michael thought the GO could be helpful in linking the leaf nodes of different ontologies. @TriageDr , would you like to clarify?
Currently, GO supports cross mappings to numerous other pathway ontologies -- the download files can be found here:
http://geneontology.org/docs/download-mappings/ Cross-references of external classification systems to GOhttp://geneontology.org/docs/download-mappings/ Cross-references to external classification systems Many Gene Ontology terms are cross-referenced to corresponding concepts from a number of external vocabularies, including Enzyme Commission numbers, KEGG, Reactome Pathways, and Wikipedia. Please report any errors or suggest alternatives to the GO helpdesk. geneontology.org GO as an organization maintains a set of these cross mappings: [cid:4127b645-d1df-49d7-8a69-5bd81d0aa727]
Other organizations maintain cross mappings to more specific data sources (scroll down in link to see externally managed GO cross references.
Key Point of Improvement: Because it looks like RTX downloaded GO, REACTOME, KEGG, NCIT, RHEA pathway ontologies from pathwaycommons, there are currently no concept cross-mappings between them. Without these cross-mappings, no inferences can be made across ontologies (GO biological function term --> REACTOME term). This is no beuno, as we cannot connect concepts/edges across the differing symmetries of the ontologies to gain confidence in any assertion. We need the cross mappings.
Ideas/Next Steps:
1) Download a complete version of GO, with all edge provenance info (Pubmed-IDs, ECO codes, etc..). This will be the central pathway data source (ie with the provenance) and ontology for gene pathway level queries.
2) Download all cross mappings of GO terms, (minimally the GO maintained cross-mappings above -- Rhea, Reactome, MetaCyc, KEGG, EC, EWAG-BBD).
3) With a complete GO download + the cross mappings to other pathway ontologies, we can then use the edges from GO + any cross mapped edges as increased support for a given inference!
4) Final comment/question on pathway commons: I know REACTOME has edge provenance (Pubmed-IDs, ECO codes) so it may be worth your time to either A) pull in edge provenance data from pathwaycommons if it exists, or B) if no provenance exists in pathwaycommons, skip it and go straight to the source data for REACTOME (and others) to ensure we are getting provenance rich ontologies downloads.
If steps 1-4 are completed I think we'll have some really awesome cross mapped provenance rich pathways data sources/ontologies that we can do some great reasoning with!
Any questions? Please reach out
Sincerely,
Michael Patton
Get Outlook for iOShttps://aka.ms/o0ukef
From: Erica Wood notifications@github.com Sent: Thursday, June 18, 2020 1:38:59 PM To: RTXteam/RTX RTX@noreply.github.com Cc: Patton, Michael J mjpatton@uab.edu; Mention mention@noreply.github.com Subject: Re: [RTXteam/RTX] ETL GO into KG2 (#838)
From what I understood, Michael thought the GO could be helpful in linking the leaf nodes of different ontologies. @TriageDrhttps://github.com/TriageDr , would you like to clarify?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/RTXteam/RTX/issues/838#issuecomment-646238521, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AKSAWBQMSYU4N74UFQ4RPH3RXJNMHANCNFSM4OBFVZHQ.
Since it seems like the main goal of bringing in GO is for the edge provenance (and for the cross references, which are not present in the go-plus.owl file), I think there might be an easier way than ETLing the go-plus.owl file. Through multi_owl_to_json_kg.py, 352059 edges and 50298 nodes are already being brought in through go-plus.owl. Thus, we don't have to start from scratch. The PMIDs in go-plus.owl in multi_owl_to_json_kg.py were assigned as node publications. GO does have edge provenance and it is easily accessible in through their annotations download: https://www.ebi.ac.uk/QuickGO/annotations?downloadLimit=100&reference=PMID If we can find a way to download a larger quantity of them (currently the max is 50,000 out of the 5,651,287 PMID annotations), this could be a fast way to get GO edge provenance.
Was hoping that there might be PubMed ID type provenance information for gene GO annotations in our ensembl_genes_homo_sapiens.json
input file, but it doesn't seem to be there. So guess we'll go to the Gene Ontology to try to get a data dump that we can ETL.
For the bulk download, I used the goa_human.gpa.gz
download from ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/HUMAN/
. It seems to work well. Below is an image of some sample edges I extracted from the data set. Please let me know if you have any thoughts about this @saramsey @TriageDr .
This is what the report file looks like:
Hi @ericawood
Thank you for the excellent summary!
A few minor points to consider:
GO_REF
will need to be defined in curies-to-urls-map.yaml
.GO:
predicates will need to be defined (and mapped to biolink predictes) in the predicate-remap.yaml
file.GO:
relation CURIEs seem to be being mapped to an odd base URL. For example, it is not clear why the relation CURIE GO:enables
is being mapped to https://identifiers.org/enables
. While it is true that we are not generally fussy about relation URIs being resolvable, we at least want the relation URI to be interpretable as to what type of knowledge source the relation comes from. And in this case, if one sees https://identifiers.org/enables
, one has no idea what ontology enables
is from. I wonder if we should re-map GO relation CURIEs to have the GOREL
CURIE prefix.Thank you for driving this issue forward!
Cheers, Steve
@saramsey Sorry for the late reply, I have done some digging, particularly into points 1 and 3:
GO_REF: "https://identifiers.org/GO_REF:"
will be added to curies-to-urls-map.yaml
For example, it is not clear why the relation CURIE GO:enables is being mapped to https://identifiers.org/enables
In kg2_util.py, urllib.parse.urljoin(relation_iri_prefix, predicate_label_to_use)
strips away the go:
in https://identifiers.org/go:
(I modeled my code after the NCBIGene and Ensembl scripts, but they use biolink relation CURIEs, so this is not an issue. It is actually very helpful that you pointed this out, because it appears to be an issue with HMDB and DRUGBANK as well. I do not know how to fix it yet. Do you have any suggestions?
@saramsey Sorry for the late reply, I have done some digging, particularly into points 1 and 3:
1. `GO_REF: "https://identifiers.org/GO_REF:"` will be added to `curies-to-urls-map.yaml`
For example, it is not clear why the relation CURIE GO:enables is being mapped to https://identifiers.org/enables
In kg2_util.py,
urllib.parse.urljoin(relation_iri_prefix, predicate_label_to_use)
strips away thego:
inhttps://identifiers.org/go:
(I modeled my code after the NCBIGene and Ensembl scripts, but they use biolink relation CURIEs, so this is not an issue. It is actually very helpful that you pointed this out, because it appears to be an issue with HMDB and DRUGBANK as well. I do not know how to fix it yet. Do you have any suggestions?
I think that's a bug. I'd like to fix it. But I want to be sure I'm fixing it in the correct branch. When the dust settles from the great branch merge, please let me know which branch I should fix this in.
For sure, using urllib.parse.urljoin
is wrong where I was using it. I am testing out a possible fix now (see #1003)
Commit 1e614a3 fixes a missing curly brace
Commit 1e614a3 fixes a missing curly brace
Thank you for that! (I had a general feeling that something was missing but couldn't pin down what!)
This seems to have worked (in KG2.3.0):
Cypher:
match (n {id: 'UniProtKB:B5MD39'})-[r]->(m {id: 'GO:0006508'}) return r.edge_label, r.publications
And in kg2-go-annotations.json
:
Closing out this issue, but see follow-up issue #1034
UAB requested that we ETL GO into KG2 to increase relationships between equivalent nodes, making their logic easier.