Devise strategy for ingesting ontologies into covid-19 KG

cmungall commented 4 years ago

Some of these guidelines are general and apply to any bio-KG but I have tried to tailor for the covid-19 KG

Guidelines

We should always use the official PURLs for download (see the ontology's page on http://obofoundry.org and look at the list of 'products')
- there is typically an obo and owl version, there is increasingly an obojson version
- for more on obojson and why it is an appropriate format for projects like this, see this post
  - the obo version may be semantically equivalent to the owl, or it may contain less. See below.
for ML/query purposes the most important aspect of an ontology is its relational graph. We also want at least class label and class category (for human readability and biolink compliance)
- note the relational graph is simply edges such as is-a, part-of, has-input, etc. This is a little complicated to parse if using the rdf export of an ontology, see this post on a proposed alternative rdf layering
- other parts of ontologies may be useful for other purposes. Extracting synonyms will be useful if the KG is to be used for text mining. Other properties will be useful if the KG is to be used as a general purpose KG
note that the default version of many ontologies has a more limited relational graph, in particular, it may be limited to exclude edges that connect to other ontologies or IDs, because historically applications assumed this. Some ontologies publish different products that include or exclude things to various degrees. This is not done consistently. Sometimes the obo version will be more limited; sometimes the obo is equivalent to the owl or json
- for GO, use go-plus (owl or json). This connects to CHEBI, UBERON, CL, ...
- for HP, the obo version is basic. The OWL version connects to GO, CHEBI, CL, etc
- for CHEBI, there are ongoing projects that connect CHEBI nodes to GO, HP, MONDO, etc (for roles)
- for MONDO, the obo version is basic. The OWL version has connections from diseases to mechanisms (e.g. GO MF,CC,BP), chemical exposures (indirectly via ECTO)
- in general it is best to use the base version of an ontology, as this has the inter-ontology edges, but does not include subsets of the external ontology as imports

Parsing

For python there are various choices of python parser

rdflib will parse the .owl for any ontology in OBO. However for reasons stated above extracting the relational graph is awkward. Fortunately kgx will do this for you
kgx - will already produce the format used in this project
ontobio - this consumes obojson as native. However, obojson is already optimized for uses like graph traversal
obonet is a handy lightweight obo format parser. However, obo format is limiting for the purposes stated above. the obo may be incomplete, and may lack a RG that connects across ontologies

Ontology pre-processing

If there is a need to pre-process an ontology ahead of time, then we recommend http://robot.obolibrary.org/ - note that OBO-compliant ontologies should have a certain amount of pre-processing done in advance. E.g. the ontology will be reasoned, and will be validated. However, we may still want to do things like extract a slice of an ontology. This is where robot can be used in a pipeline. It can also be remote-controlled via python.

A key ROBOT command is http://robot.obolibrary.org/extract

This allows us to bring in only the relevant portions of ontologies. We would create a seed file, that is a list of all ontology IDs used in the graph. Then we would extract a merged ontology using this seed (using BOT method).

Which ontologies?

Which ontologies should be ingested, and which portions? This depends ultimately on the application, and how it will be used. For a 'traditional' RDF KG we might simply try to ingest everything. Queries can always exclude stuff. However, downstream applications may make particular assumptions. For example, some applications may make assumptions that number of hops over a RG is meaningful. A node embedding algorithm may make certain assumptions about graph properties and hence graph random walks. the assumptions of application developed and in particular ML applications may not be aligned with assumptions of ontology developers. (see this post on KG design patterns ). Care should be taken when ingesting an ontology that the subset brought in is meaningful. We may want certain post-processing steps e.g. to implement shortcut relations that bypass non-informative (in the information theory sense) intermediate nodes.

I am not familiar enough with the N2V algorithm to give more specific recommendations. But I note that Rob H's group has had success with node embedding over graphs of OBO ontologies + annotations, it's worth looking at this (caveat: when I say annotations I mean in the traditional sense of associations to genes etc. I am not sure that bringing in owl annotations as opa2vec does is always useful)

Some specific ontology recommendations. Generally an ontology is not useful unless it connects existing data elements, so we should avoid ingesting the kitchen sink here. Though some ontologies like CHEBI are more 'database-like'. We should have separate tickets for each of these, I am just making general comments here

HPO will be necessary for covid symptoms. Note we have an obonet obo format ingest (#55) but this lacks linkages from HPO to other ontologies as noted above
- we lack an ingest for HPO annotations. This may be less useful for COVID-19 since these are for genetic diseases. However, we should really manually curate annotations for COVID-19 and diseases caused by related viruses, e.g. SARS-CoV, MERS
- the RG used in the owl/json HP is a bit abstruse due to the subq pattern, we need shortcut edges here
CHEBI will be useful for information on small molecules, whether they are drugs, or pathway products or metabolites.
- there will be some redundancy with drugcentral (#5)
- CHEBI contains useful information on chemical properties but it's not always clear how to extract this
- there are separate projects for connecting CHEBI to other ontologies and entities, see https://github.com/obophenotype/chiro/issues/19
GO contains descriptors for the roles and locations of both viral genes and genes in the host
- there is already a ticket #19 for GO annotations for viral genes
- note there will soon be GO-CAMs for SARS-CoV-2
CL may be useful for datasets such as https://www.covid19cellatlas.org/
Uberon may not be so useful for COVID-19. To be clear, it's certainly important to describe things such as site of infection (respiratory tract, upper vs lower, etc) and to do this using Uberon; however, the anatomical linkages in Uberon may not be so useful for ML traversal
Ditto for NCBITaxon (but we should use NCBITaxon CURIEs for things such as the nodes for human, SARS-CoV-2)
Ditto for Mondo

PaulNSchofield commented 4 years ago

The Lungmap ontology might be useful. Its not perfect but has a very granular set of terms for anatomical and cellular adult lung components.

matentzn commented 4 years ago

Happy to help with any ontology pre-processing necessary. I think we should not only maintain a list of imports, but also a list of object (and annotation) properties, and then materialise those between names only (A sub R some B-axioms). This is in particular important for phenotypes (note that upheno2 already has the materialised has affected entity and has phenotypic orthologue relations, if cross-species integration is relevant). There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches. Maybe you want something like a COVID focused Monarch.owl?

cmungall commented 4 years ago

There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches

Good point. In theory methods like OPA2Vec should work with vague non-logical axioms. However, I think it makes most sense to use these upstream in graph normalization.

I don't know which strategy kg-hub intends to use (cc @justaddcoffee @deepakunni3)

It is the responsibility of each ingestor to produce edges and nodes using the preferred ID space of this project OR
each ingestor uses the ID space that is most convenient, and EITHER
- there is a clique merge step that happens when making the final KG - leveraging xrefs and/or equivalence axioms (see https://douroucouli.wordpress.com/2019/05/27/never-mind-the-logix-taming-the-semantic-anarchy-of-mappings-in-ontologie/) OR
- the N2V graph walk step in aware of equivalence axioms and can walk them as if they are length 0, or just let these fall out naturally

cmungall commented 4 years ago

@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?

justaddcoffee commented 4 years ago

It is the responsibility of each ingestor to produce edges and nodes using the preferred ID space of this project OR

My thinking was that we'd do this, at least at first - ingestors must use the preferred ID space. I've gone to a bit of trouble for example to find Uniprot IDs for genes we ingest in a few kg_covid_19 ingests.

It'd simplify the transform a lot to not have to worry about this, of course. A clique-merge step would be really handy, but I don't think we have any functionality like this now. Does KGX provide anything like this?

cmungall commented 4 years ago

kgx does have clique merge. it does add a bit of pipeline complexity though. E.g. if you don't know in advance what species of genes you will ingest, then the clique merge part of the pipeline will need a graph with equivalence edges between all genes/proteins in all species, which will be large. It also opens us up to global multiclique errors. We should probably open a new ticket for this

deepakunni3 commented 4 years ago

Yes, KGX does support clique merge, but I agree with @cmungall . It adds complexity to the graph merge operation. It would be much easier to handle this at the time of transformation, even if that involves a few extra parsing steps. We can iterate over each data source so that it doesn't have to happen on the first iteration. Also, this ensures that each transformed data source, by itself, is usable without relying on additional transformations downstream.

PaulNSchofield commented 4 years ago

@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?

@cmungall - There is a resource on https://lungmap.net/ . Quite an interesting set of 'omics data. They mainly focus on lung development so lots of infant and juvenile data but also some adult. Also see https://research.cchmc.org/pbge/LGEAOntologyBrowser/#/help but not sure how much these overlap. Im not aware of anything outside this project. There's mouse stuff too for which they use their mouse lung ontology. Summary in https://www.atsjournals.org/doi/abs/10.1164/ajrccm-conference.2018.197.1_MeetingAbstracts.A6126. Might be good to get Susan Wert from Cincinnati on board?

justaddcoffee commented 3 years ago

I'm wondering if we can close this ticket - @deepakunni3's ontology ingest I think ticks most of @cmungall's boxes up top. (Rereading this thread, it's a very useful reference - possibly could go on a wiki page.)

justaddcoffee commented 3 years ago

I'm going to close this ticket, although we maybe could at some point capture Chris's advice above in a wiki page somewhere

deepakunni3 commented 3 years ago

Yes, let's capture this in documentation. Especially for KG-Hub. We already follow most of the recommendations above. But its good to formalize the process so that it's used throughout our KG efforts :)

Knowledge-Graph-Hub / kg-covid-19