Knowledge-Graph-Hub / kg-covid-19

An instance of KG Hub to produce a knowledge graph for COVID-19 response.
https://github.com/Knowledge-Graph-Hub/kg-covid-19/wiki
BSD 3-Clause "New" or "Revised" License
80 stars 26 forks source link

Devise strategy for ingesting ontologies into covid-19 KG #71

Closed cmungall closed 3 years ago

cmungall commented 4 years ago

Some of these guidelines are general and apply to any bio-KG but I have tried to tailor for the covid-19 KG

Guidelines

Parsing

For python there are various choices of python parser

Ontology pre-processing

If there is a need to pre-process an ontology ahead of time, then we recommend http://robot.obolibrary.org/ - note that OBO-compliant ontologies should have a certain amount of pre-processing done in advance. E.g. the ontology will be reasoned, and will be validated. However, we may still want to do things like extract a slice of an ontology. This is where robot can be used in a pipeline. It can also be remote-controlled via python.

A key ROBOT command is http://robot.obolibrary.org/extract

This allows us to bring in only the relevant portions of ontologies. We would create a seed file, that is a list of all ontology IDs used in the graph. Then we would extract a merged ontology using this seed (using BOT method).

Which ontologies?

Which ontologies should be ingested, and which portions? This depends ultimately on the application, and how it will be used. For a 'traditional' RDF KG we might simply try to ingest everything. Queries can always exclude stuff. However, downstream applications may make particular assumptions. For example, some applications may make assumptions that number of hops over a RG is meaningful. A node embedding algorithm may make certain assumptions about graph properties and hence graph random walks. the assumptions of application developed and in particular ML applications may not be aligned with assumptions of ontology developers. (see this post on KG design patterns ). Care should be taken when ingesting an ontology that the subset brought in is meaningful. We may want certain post-processing steps e.g. to implement shortcut relations that bypass non-informative (in the information theory sense) intermediate nodes.

I am not familiar enough with the N2V algorithm to give more specific recommendations. But I note that Rob H's group has had success with node embedding over graphs of OBO ontologies + annotations, it's worth looking at this (caveat: when I say annotations I mean in the traditional sense of associations to genes etc. I am not sure that bringing in owl annotations as opa2vec does is always useful)

Some specific ontology recommendations. Generally an ontology is not useful unless it connects existing data elements, so we should avoid ingesting the kitchen sink here. Though some ontologies like CHEBI are more 'database-like'. We should have separate tickets for each of these, I am just making general comments here

PaulNSchofield commented 4 years ago

The Lungmap ontology might be useful. Its not perfect but has a very granular set of terms for anatomical and cellular adult lung components.

matentzn commented 4 years ago

Happy to help with any ontology pre-processing necessary. I think we should not only maintain a list of imports, but also a list of object (and annotation) properties, and then materialise those between names only (A sub R some B-axioms). This is in particular important for phenotypes (note that upheno2 already has the materialised has affected entity and has phenotypic orthologue relations, if cross-species integration is relevant). There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches. Maybe you want something like a COVID focused Monarch.owl?

cmungall commented 4 years ago

There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches

Good point. In theory methods like OPA2Vec should work with vague non-logical axioms. However, I think it makes most sense to use these upstream in graph normalization.

I don't know which strategy kg-hub intends to use (cc @justaddcoffee @deepakunni3)

  1. It is the responsibility of each ingestor to produce edges and nodes using the preferred ID space of this project OR
  2. each ingestor uses the ID space that is most convenient, and EITHER
cmungall commented 4 years ago

@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?

justaddcoffee commented 4 years ago

It is the responsibility of each ingestor to produce edges and nodes using the preferred ID space of this project OR

My thinking was that we'd do this, at least at first - ingestors must use the preferred ID space. I've gone to a bit of trouble for example to find Uniprot IDs for genes we ingest in a few kg_covid_19 ingests.

It'd simplify the transform a lot to not have to worry about this, of course. A clique-merge step would be really handy, but I don't think we have any functionality like this now. Does KGX provide anything like this?

cmungall commented 4 years ago

kgx does have clique merge. it does add a bit of pipeline complexity though. E.g. if you don't know in advance what species of genes you will ingest, then the clique merge part of the pipeline will need a graph with equivalence edges between all genes/proteins in all species, which will be large. It also opens us up to global multiclique errors. We should probably open a new ticket for this

deepakunni3 commented 4 years ago

Yes, KGX does support clique merge, but I agree with @cmungall . It adds complexity to the graph merge operation. It would be much easier to handle this at the time of transformation, even if that involves a few extra parsing steps. We can iterate over each data source so that it doesn't have to happen on the first iteration. Also, this ensures that each transformed data source, by itself, is usable without relying on additional transformations downstream.

PaulNSchofield commented 4 years ago

@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?

@cmungall - There is a resource on https://lungmap.net/ . Quite an interesting set of 'omics data. They mainly focus on lung development so lots of infant and juvenile data but also some adult. Also see https://research.cchmc.org/pbge/LGEAOntologyBrowser/#/help but not sure how much these overlap. Im not aware of anything outside this project. There's mouse stuff too for which they use their mouse lung ontology. Summary in https://www.atsjournals.org/doi/abs/10.1164/ajrccm-conference.2018.197.1_MeetingAbstracts.A6126. Might be good to get Susan Wert from Cincinnati on board?

justaddcoffee commented 3 years ago

I'm wondering if we can close this ticket - @deepakunni3's ontology ingest I think ticks most of @cmungall's boxes up top. (Rereading this thread, it's a very useful reference - possibly could go on a wiki page.)

justaddcoffee commented 3 years ago

I'm going to close this ticket, although we maybe could at some point capture Chris's advice above in a wiki page somewhere

deepakunni3 commented 3 years ago

Yes, let's capture this in documentation. Especially for KG-Hub. We already follow most of the recommendations above. But its good to formalize the process so that it's used throughout our KG efforts :)