Closed cmungall closed 3 years ago
The Lungmap ontology might be useful. Its not perfect but has a very granular set of terms for anatomical and cellular adult lung components.
Happy to help with any ontology pre-processing necessary. I think we should not only maintain a list of imports, but also a list of object (and annotation) properties, and then materialise those between names only (A sub R some B
-axioms). This is in particular important for phenotypes (note that upheno2 already has the materialised has affected entity
and has phenotypic orthologue
relations, if cross-species integration is relevant). There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches. Maybe you want something like a COVID focused Monarch.owl?
There is also the question of whether vague axioms like xrefs should be normalised to IRI format first (I have sparql update queries ready for that). But even if they are normalised, its unclear whether they would confuse or support ML approaches
Good point. In theory methods like OPA2Vec should work with vague non-logical axioms. However, I think it makes most sense to use these upstream in graph normalization.
I don't know which strategy kg-hub intends to use (cc @justaddcoffee @deepakunni3)
@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?
It is the responsibility of each ingestor to produce edges and nodes using the preferred ID space of this project OR
My thinking was that we'd do this, at least at first - ingestors must use the preferred ID space. I've gone to a bit of trouble for example to find Uniprot IDs for genes we ingest in a few kg_covid_19 ingests.
It'd simplify the transform a lot to not have to worry about this, of course. A clique-merge step would be really handy, but I don't think we have any functionality like this now. Does KGX provide anything like this?
kgx does have clique merge. it does add a bit of pipeline complexity though. E.g. if you don't know in advance what species of genes you will ingest, then the clique merge part of the pipeline will need a graph with equivalence edges between all genes/proteins in all species, which will be large. It also opens us up to global multiclique errors. We should probably open a new ticket for this
Yes, KGX does support clique merge, but I agree with @cmungall . It adds complexity to the graph merge operation. It would be much easier to handle this at the time of transformation, even if that involves a few extra parsing steps. We can iterate over each data source so that it doesn't have to happen on the first iteration. Also, this ensures that each transformed data source, by itself, is usable without relying on additional transformations downstream.
@PaulNSchofield - thanks for the suggestion. Do you know is there much data (sc or otherwise) annotated to the lungmap ontology?
@cmungall - There is a resource on https://lungmap.net/ . Quite an interesting set of 'omics data. They mainly focus on lung development so lots of infant and juvenile data but also some adult. Also see https://research.cchmc.org/pbge/LGEAOntologyBrowser/#/help but not sure how much these overlap. Im not aware of anything outside this project. There's mouse stuff too for which they use their mouse lung ontology. Summary in https://www.atsjournals.org/doi/abs/10.1164/ajrccm-conference.2018.197.1_MeetingAbstracts.A6126. Might be good to get Susan Wert from Cincinnati on board?
I'm wondering if we can close this ticket - @deepakunni3's ontology ingest I think ticks most of @cmungall's boxes up top. (Rereading this thread, it's a very useful reference - possibly could go on a wiki page.)
I'm going to close this ticket, although we maybe could at some point capture Chris's advice above in a wiki page somewhere
Yes, let's capture this in documentation. Especially for KG-Hub. We already follow most of the recommendations above. But its good to formalize the process so that it's used throughout our KG efforts :)
Some of these guidelines are general and apply to any bio-KG but I have tried to tailor for the covid-19 KG
Guidelines
base
version of an ontology, as this has the inter-ontology edges, but does not include subsets of the external ontology as importsParsing
For python there are various choices of python parser
.owl
for any ontology in OBO. However for reasons stated above extracting the relational graph is awkward. Fortunately kgx will do this for youOntology pre-processing
If there is a need to pre-process an ontology ahead of time, then we recommend http://robot.obolibrary.org/ - note that OBO-compliant ontologies should have a certain amount of pre-processing done in advance. E.g. the ontology will be reasoned, and will be validated. However, we may still want to do things like extract a slice of an ontology. This is where robot can be used in a pipeline. It can also be remote-controlled via python.
A key ROBOT command is http://robot.obolibrary.org/extract
This allows us to bring in only the relevant portions of ontologies. We would create a seed file, that is a list of all ontology IDs used in the graph. Then we would extract a merged ontology using this seed (using BOT method).
Which ontologies?
Which ontologies should be ingested, and which portions? This depends ultimately on the application, and how it will be used. For a 'traditional' RDF KG we might simply try to ingest everything. Queries can always exclude stuff. However, downstream applications may make particular assumptions. For example, some applications may make assumptions that number of hops over a RG is meaningful. A node embedding algorithm may make certain assumptions about graph properties and hence graph random walks. the assumptions of application developed and in particular ML applications may not be aligned with assumptions of ontology developers. (see this post on KG design patterns ). Care should be taken when ingesting an ontology that the subset brought in is meaningful. We may want certain post-processing steps e.g. to implement shortcut relations that bypass non-informative (in the information theory sense) intermediate nodes.
I am not familiar enough with the N2V algorithm to give more specific recommendations. But I note that Rob H's group has had success with node embedding over graphs of OBO ontologies + annotations, it's worth looking at this (caveat: when I say annotations I mean in the traditional sense of associations to genes etc. I am not sure that bringing in owl annotations as opa2vec does is always useful)
Some specific ontology recommendations. Generally an ontology is not useful unless it connects existing data elements, so we should avoid ingesting the kitchen sink here. Though some ontologies like CHEBI are more 'database-like'. We should have separate tickets for each of these, I am just making general comments here