BgeeDB / bgee_apps

Source code of the Java Bgee applications
https://www.bgee.org/
Creative Commons Zero v1.0 Universal
2 stars 1 forks source link

Uberon relation reductions in our pipeline #227

Open fbastian opened 4 years ago

fbastian commented 4 years ago

We noticed the following issue: from the term UBERON:0001295 endometrium, in platypus we can reach the following ancestors through indirect relations:

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.anatEntitySourceId = 'UBERON:0001295' and t1.relationType = 'is_a part_of' and t1.relationStatus = 'indirect';
+--------------------+----------------------------+
| anatEntityTargetId | anatEntityName             |
+--------------------+----------------------------+
| UBERON:0000025     | tube                       |
| UBERON:0000060     | anatomical wall            |
| UBERON:0000061     | anatomical structure       |
| UBERON:0000062     | organ                      |
| UBERON:0000064     | organ part                 |
| UBERON:0000344     | mucosa                     |
| UBERON:0000465     | material anatomical entity |
| UBERON:0000467     | anatomical system          |
| UBERON:0000468     | multi-cellular organism    |
| UBERON:0000474     | female reproductive system |
| UBERON:0000480     | anatomical group           |
| UBERON:0000990     | reproductive system        |
| UBERON:0000993     | oviduct                    |
| UBERON:0003100     | female organism            |
| UBERON:0003133     | reproductive organ         |
| UBERON:0004111     | anatomical conduit         |
| UBERON:0004120     | mesoderm-derived structure |
| UBERON:0004175     | internal genitalia         |
| UBERON:0004923     | organ component layer      |
| UBERON:0005156     | reproductive structure     |
| UBERON:0013515     | subdivision of oviduct     |
| UBERON:0013522     | subdivision of tube        |

But we do not manage to reach the following structures by following the chain of direct relations stored in the database for platypus:

UBERON:0000025                  tube
UBERON:0000993               oviduct
UBERON:0004111    anatomical conduit
UBERON:0004175    internal genitalia
UBERON:0013515 subdivision of oviduct
UBERON:0013522   subdivision of tube

They are all ancestors of uterus, which does not exist in platypus. Hmm... See the chain of direct relations in the database for platypus:

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId = 'UBERON:0001295';
+--------------------+----------------------------+
| anatEntityTargetId | anatEntityName             |
+--------------------+----------------------------+
| UBERON:0019042     | reproductive system mucosa |
+--------------------+----------------------------+
1 row in set (0.09 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId = 'UBERON:0019042';
+--------------------+------------------------+
| anatEntityTargetId | anatEntityName         |
+--------------------+------------------------+
| UBERON:0000344     | mucosa                 |
| UBERON:0005156     | reproductive structure |
+--------------------+------------------------+
2 rows in set (0.12 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000344', 'UBERON:0005156');
+--------------------+-----------------------+
| anatEntityTargetId | anatEntityName        |
+--------------------+-----------------------+
| UBERON:0004923     | organ component layer |
| UBERON:0000990     | reproductive system   |
+--------------------+-----------------------+
2 rows in set (0.11 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0004923', 'UBERON:0000990');
+--------------------+----------------------------+
| anatEntityTargetId | anatEntityName             |
+--------------------+----------------------------+
| UBERON:0004120     | mesoderm-derived structure |
| UBERON:0000060     | anatomical wall            |
+--------------------+----------------------------+
2 rows in set (0.10 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0004120', 'UBERON:0000060');
+--------------------+----------------------+
| anatEntityTargetId | anatEntityName       |
+--------------------+----------------------+
| UBERON:0000064     | organ part           |
| UBERON:0000061     | anatomical structure |
+--------------------+----------------------+
2 rows in set (0.07 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000064', 'UBERON:0000061');
+--------------------+----------------------------+
| anatEntityTargetId | anatEntityName             |
+--------------------+----------------------------+
| UBERON:0000465     | material anatomical entity |
| UBERON:0000062     | organ                      |
+--------------------+----------------------------+
2 rows in set (0.08 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000465', 'UBERON:0000062');
+--------------------+-------------------+
| anatEntityTargetId | anatEntityName    |
+--------------------+-------------------+
| UBERON:0000467     | anatomical system |
+--------------------+-------------------+
1 row in set (0.10 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000467');
+--------------------+------------------+
| anatEntityTargetId | anatEntityName   |
+--------------------+------------------+
| UBERON:0000480     | anatomical group |
+--------------------+------------------+
1 row in set (0.09 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000480');
+--------------------+-------------------------+
| anatEntityTargetId | anatEntityName          |
+--------------------+-------------------------+
| UBERON:0000468     | multi-cellular organism |
+--------------------+-------------------------+
1 row in set (0.14 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000468');
+--------------------+----------------------+
| anatEntityTargetId | anatEntityName       |
+--------------------+----------------------+
| UBERON:0000061     | anatomical structure |
+--------------------+----------------------+
1 row in set (0.10 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000061');
+--------------------+----------------------------+
| anatEntityTargetId | anatEntityName             |
+--------------------+----------------------------+
| UBERON:0000465     | material anatomical entity |
+--------------------+----------------------------+
1 row in set (0.16 sec)

mysql> select t1.anatEntityTargetId, t3.anatEntityName from anatEntityRelation as t1 inner join anatEntityRelationTaxonConstraint as t2 on t1.anatEntityRelationId = t2.anatEntityRelationId inner join anatEntity as t3 on t1.anatEntityTargetId = t3.anatEntityId where (t2.speciesId is null or t2.speciesId = 9258) and t1.relationType = 'is_a part_of' and t1.relationStatus = 'direct' and t1.anatEntitySourceId in ('UBERON:0000465');
Empty set (0.08 sec)

The direct/indirect relations are retrieved in our pipeline, from our custom version of uberon in generated_files/uberon/custom_composite.obo, in the method org.bgee.pipeline.uberon.InsertUberon.generateRelationTOsFirstPass(Map, Map, Uberon, Set, Collection) of bgee_pipeline of BgeeDB/bgee_apps. This code uses OWLGraphWrapper. First, it retrieves all relations, with chains of object properties packed if possible, with the method OWLGraphWrapper.getOutgoingEdgesNamedClosureOverSupPropsWithGCI(OWLClass). Then it also retrieves direct relations with the method OWLGraphWrapper.getOutgoingEdgesWithGCI(OWLClass), that's how the distinction between direct and indirect relations is done.

=> it means that OWLGraphWrapper has inferred by relation reduction a set of indirect relations that we do not retrieve just by following the direct relations we have stored in the Bgee database. Need to investigate how the relations returned by getOutgoingEdgesNamedClosureOverSupPropsWithGCI are produced. Is it a bug, or is it all good?

IDEA: maybe the GCI relations do not consider taxon constraints on OWLClasses. We take them into account for insertion into the database. Maybe getOutgoingEdgesNamedClosureOverSupPropsWithGCI and getOutgoingEdgesWithGCI have indeed retrieved relations between endometrium and uterus in platypus, but we have removed them at time of insertion in the database because uterus does not exist in platypus. But then, we still have kept the relations that had been inferred thanks to the relations incoming/outgoing from uterus. The fix would be to be able to discard the relations returned by getOutgoingEdgesNamedClosureOverSupPropsWithGCI if they go through an OWLClass that does not exist in the requested species And if it is really the source of the problem, after the fix we can add a check after insertion in the database, walking the path of direct relations to check whether we retrieve exactly the same terms reached by indirect relations.

But, if it is not a bug and these ancestors are really reachable through relation reduction, we need to think about how to provide relations to topGO: we provide to it only the direct relations, so that it is not capable of reaching the terms we reach in Bgee through inferred indirect relations.

fbastian commented 4 years ago

Actually, my idea mentioned above seems incorrect, our pipeline code does exactly the check I thought it was not doing. In class InsertUberon, lines 734 to 781. I need to log the edges produced from uterus to investigate further.

fbastian commented 4 years ago

The problem comes from insertion of Uberon in our database. Some direct relations can be seen as indirect, because we sometimes have classes with such relations:

1: (UBERON_0010011 and part_of NCBITaxon_9443) part_of UBERON_0010011
2: UBERON_0010011 SubClassOf UBERON_0010009

(no idea why relation 1 is needed, but this creates an indirect relation going from UBERON_0010011 to UBERON_0010009 in Primata)

Or:

1: UBERON_0001532 SubClassOf UBERON_0003496
2: UBERON_0003496 part_of UBERON_0011362
3: (UBERON_0011362 and part_of NCBITaxon_7954) SubClassOf UBERON_0003496

(the latter is to say that, in Danio, all UBERON_0011362 are UBERON_0003496, and it is cause of cycles (see https://github.com/obophenotype/uberon/issues/651), but it's managed in another piece of code; it creates an indirect relation going from UBERON_0001532 to UBERON_0003496 in Danio).

In these cases, we would get the correct direct relation as well (e.g., UBERON_0001532 SubClassOf UBERON_0003496), with a corresponding direct outgoing edge, but indirect relations have priority in the method org.bgee.pipeline.uberon.InsertUberon.generateRelationTOsSecondPass(Map, Set, Map, Set, Collection) (because, if we have in Uberon, e.g., A part_of B, A part_of C, B part_of C, we want A part_of C to be stored as an indirect relation going through B, and not to be stored as a direct relation that would be redundant). Because of the cycle, this creates an equivalent indirect outgoing edge, that has priority over the direct outgoing edge.

fbastian commented 4 years ago

Now that I think about it:

=> for consistent call propagation between Bgee and topGO, we need to be sure that all indirect relations can be retrieved through a chain of direct relations.

Maybe, instead of using OWLGraphWrapper to retrieve all outgoing edges (including indirect relations), we should do the object property composition ourselves, to generate the indirect relations by walking the chain of direct relations. This way, we would be certain that our graph and the graph produced by topGO would be identical. The problem is that maybe we will "miss" indirect relations that OWLGraphWrapper would have retrieved using the complex axioms in Uberon. But what's the point if it leads to have inconsistencies with topGO? Beside, Uberon is pre-reasonned, so any complex relation should have already been materialized.