Open fbastian opened 4 years ago
Actually, my idea mentioned above seems incorrect, our pipeline code does exactly the check I thought it was not doing. In class InsertUberon, lines 734 to 781. I need to log the edges produced from uterus to investigate further.
The problem comes from insertion of Uberon in our database. Some direct relations can be seen as indirect, because we sometimes have classes with such relations:
1: (UBERON_0010011 and part_of NCBITaxon_9443) part_of UBERON_0010011
2: UBERON_0010011 SubClassOf UBERON_0010009
(no idea why relation 1 is needed, but this creates an indirect relation going from UBERON_0010011 to UBERON_0010009 in Primata)
Or:
1: UBERON_0001532 SubClassOf UBERON_0003496
2: UBERON_0003496 part_of UBERON_0011362
3: (UBERON_0011362 and part_of NCBITaxon_7954) SubClassOf UBERON_0003496
(the latter is to say that, in Danio, all UBERON_0011362 are UBERON_0003496, and it is cause of cycles (see https://github.com/obophenotype/uberon/issues/651), but it's managed in another piece of code; it creates an indirect relation going from UBERON_0001532 to UBERON_0003496 in Danio).
In these cases, we would get the correct direct relation as well (e.g., UBERON_0001532 SubClassOf UBERON_0003496), with a corresponding direct outgoing edge, but indirect relations have priority in the method org.bgee.pipeline.uberon.InsertUberon.generateRelationTOsSecondPass(Map, Set, Map, Set, Collection)
(because, if we have in Uberon, e.g., A part_of B, A part_of C, B part_of C, we want A part_of C to be stored as an indirect relation going through B, and not to be stored as a direct relation that would be redundant).
Because of the cycle, this creates an equivalent indirect outgoing edge, that has priority over the direct outgoing edge.
Now that I think about it:
=> for consistent call propagation between Bgee and topGO, we need to be sure that all indirect relations can be retrieved through a chain of direct relations.
Maybe, instead of using OWLGraphWrapper
to retrieve all outgoing edges (including indirect relations), we should do the object property composition ourselves, to generate the indirect relations by walking the chain of direct relations. This way, we would be certain that our graph and the graph produced by topGO would be identical.
The problem is that maybe we will "miss" indirect relations that OWLGraphWrapper
would have retrieved using the complex axioms in Uberon. But what's the point if it leads to have inconsistencies with topGO? Beside, Uberon is pre-reasonned, so any complex relation should have already been materialized.
We noticed the following issue: from the term UBERON:0001295 endometrium, in platypus we can reach the following ancestors through indirect relations:
But we do not manage to reach the following structures by following the chain of direct relations stored in the database for platypus:
They are all ancestors of uterus, which does not exist in platypus. Hmm... See the chain of direct relations in the database for platypus:
The direct/indirect relations are retrieved in our pipeline, from our custom version of uberon in generated_files/uberon/custom_composite.obo, in the method
org.bgee.pipeline.uberon.InsertUberon.generateRelationTOsFirstPass(Map, Map, Uberon, Set, Collection)
of bgee_pipeline of BgeeDB/bgee_apps. This code usesOWLGraphWrapper
. First, it retrieves all relations, with chains of object properties packed if possible, with the methodOWLGraphWrapper.getOutgoingEdgesNamedClosureOverSupPropsWithGCI(OWLClass)
. Then it also retrieves direct relations with the methodOWLGraphWrapper.getOutgoingEdgesWithGCI(OWLClass)
, that's how the distinction between direct and indirect relations is done.=> it means that
OWLGraphWrapper
has inferred by relation reduction a set of indirect relations that we do not retrieve just by following the direct relations we have stored in the Bgee database. Need to investigate how the relations returned bygetOutgoingEdgesNamedClosureOverSupPropsWithGCI
are produced. Is it a bug, or is it all good?IDEA: maybe the GCI relations do not consider taxon constraints on OWLClasses. We take them into account for insertion into the database. Maybe
getOutgoingEdgesNamedClosureOverSupPropsWithGCI
andgetOutgoingEdgesWithGCI
have indeed retrieved relations between endometrium and uterus in platypus, but we have removed them at time of insertion in the database because uterus does not exist in platypus. But then, we still have kept the relations that had been inferred thanks to the relations incoming/outgoing from uterus. The fix would be to be able to discard the relations returned bygetOutgoingEdgesNamedClosureOverSupPropsWithGCI
if they go through an OWLClass that does not exist in the requested species And if it is really the source of the problem, after the fix we can add a check after insertion in the database, walking the path of direct relations to check whether we retrieve exactly the same terms reached by indirect relations.But, if it is not a bug and these ancestors are really reachable through relation reduction, we need to think about how to provide relations to topGO: we provide to it only the direct relations, so that it is not capable of reaching the terms we reach in Bgee through inferred indirect relations.