INCATools / boomer

Bayesian OWL ontology merging
https://incatools.github.io/boomer/
BSD 3-Clause "New" or "Revised" License
28 stars 2 forks source link

Merging 14 Ontologies (huge merge) #403

Open OliverHex opened 9 months ago

OliverHex commented 9 months ago

Hello,

I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.

This is how I proceed :

I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.

By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAM​).

But I also need the MESH and the NCIT ontologies to be included in my merge result.

Overall, I am wondering if that's the correct way to proceed ?

Here follow some questions :

  1. Should I continue with this strategy ? -> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...

  2. Or should I change my merging strategy ? -> Should I split the problem into smaller sub-problems -> Then organize them in some order (according to some criteria) : this could introduce some bias... -> And launch Boomer following this order.

    For example, I could try this :

    • I convert the 91 alignments into 91 ptables (instead of converting and merging them into 1 single ptable)
    • For each of the 91 ptables ----> I launch Boomer with this ptable and the union OWL file. ----> In the union OWL file, I add all the equivalence axioms generated by Boomer for this ptable.

    So far, it seems to work much faster. But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.

Any suggestions ?

Oliver

PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .

balhoff commented 8 months ago

Hi @OliverHex, sorry — I was busy last week then was out sick. Doing a really large mapping like this will take some experimentation. I suspect you may have to do it incrementally as you suggest. I think you're pushing the boundaries of what we've applied boomer to so far! @matentzn might have some insights, but I think he has hit some of the same issues (and may have worked with some of the same ontologies). Sorry I haven't been more helpful so far; recently I haven't had too much time to work deeply on boomer.

matentzn commented 8 months ago

I hid much of the same limits @OliverHex - Unfortunately I had to shelve my work on this for the time being despite it being such a super high priority. I think the best workflow is actually to re-imagine boomer as a curation tool rather than a mass alignment tool:

image

So basically, you align, and use the low priority cliques to find issues, fix the input alignment and iterate.

But the problem of aligning so many conflicting ontologies remains. In my view, even if we figure out the scaling issue, this problem cannot the solved right now properly unless we can actually encode the subclassOf edges in the input to probabilistic statements first (there are so many conflicts around disease ontologies).

Please feel free to keep us posted - I unfortunately do not have a good solution for you right now.

OliverHex commented 8 months ago

Thank you very much for your answers, I start to understand what are the strengths/limits of Boomer.

Here is an update :

I have launched Boomer with these parameters :

In your answer, you say that Boomer is used as a curation tool by focusing on the low probability cliques to curate the input mappings.

This is very interesting.

Is there a documentation or wiki that explains the methodology to use Boomer as a curation tool ? How may I find the cliques contents (the entities IRIs in the cliques) ? (I just get the cliques size in the console output)

Oliver Log file : log.txt

matentzn commented 8 months ago

Hey @OliverHex What you are doing is of great importance to me as well. If you like, add me on LinkedIn or send me an email here: https://github.com/monarch-initiative/pheval/blob/a685b171344cedf0f6ab37962fd8e6da36faa575/pyproject.toml#L7 (just a random place I found where my email was published - GitHub hides these), and we can set up a call to see if we can join forces.

matentzn commented 7 months ago

@OliverHex just following up - are you still working on this? Interested to push the envelope a bit together?

OliverHex commented 6 months ago

Hello,

Sorry for replying so late... Yes sure, I am interested to further explore ontology alignment and bayesian merging ! But at the current moment, I am working on something else. I might switch back on ontology alignment and bayesian merging in a few weeks. I will keep you updated, thanks for asking !

Oliver