Open OliverHex opened 9 months ago
Hi @OliverHex, sorry — I was busy last week then was out sick. Doing a really large mapping like this will take some experimentation. I suspect you may have to do it incrementally as you suggest. I think you're pushing the boundaries of what we've applied boomer to so far! @matentzn might have some insights, but I think he has hit some of the same issues (and may have worked with some of the same ontologies). Sorry I haven't been more helpful so far; recently I haven't had too much time to work deeply on boomer.
I hid much of the same limits @OliverHex - Unfortunately I had to shelve my work on this for the time being despite it being such a super high priority. I think the best workflow is actually to re-imagine boomer as a curation tool rather than a mass alignment tool:
So basically, you align, and use the low priority cliques to find issues, fix the input alignment and iterate.
But the problem of aligning so many conflicting ontologies remains. In my view, even if we figure out the scaling issue, this problem cannot the solved right now properly unless we can actually encode the subclassOf edges in the input to probabilistic statements first (there are so many conflicts around disease ontologies).
Please feel free to keep us posted - I unfortunately do not have a good solution for you right now.
Thank you very much for your answers, I start to understand what are the strengths/limits of Boomer.
Here is an update :
What is the signification of these error messages ?
I have launched Boomer with these parameters :
What do these parameters mean ? Is there a way to get boomer to produce some output by changing these parameters ?
In your answer, you say that Boomer is used as a curation tool by focusing on the low probability cliques to curate the input mappings.
This is very interesting.
Is there a documentation or wiki that explains the methodology to use Boomer as a curation tool ? How may I find the cliques contents (the entities IRIs in the cliques) ? (I just get the cliques size in the console output)
Oliver Log file : log.txt
Hey @OliverHex What you are doing is of great importance to me as well. If you like, add me on LinkedIn or send me an email here: https://github.com/monarch-initiative/pheval/blob/a685b171344cedf0f6ab37962fd8e6da36faa575/pyproject.toml#L7 (just a random place I found where my email was published - GitHub hides these), and we can set up a call to see if we can join forces.
@OliverHex just following up - are you still working on this? Interested to push the envelope a bit together?
Hello,
Sorry for replying so late... Yes sure, I am interested to further explore ontology alignment and bayesian merging ! But at the current moment, I am working on something else. I might switch back on ontology alignment and bayesian merging in a few weeks. I will keep you updated, thanks for asking !
Oliver
Hello,
I am trying to merge 14 ontologies at once with Boomer : DERMO, DO, HUGO, ICDO, IDO, IEDB, MESH, MFOMD, MPATH, NCIT, OBI, OGMS, ORPHANET and SCDO.
This is how I proceed :
I have run various tests and it seems that when the ptable is too large, the problem becomes intractable.
By removing the MESH and NCIT (i.e. now I try to merge 12 ontologies), the resulting union ontology is only 81K classes (242 MB) and the ptable contains only 7K entries. In this case, Boomer ends with a result in 30 min (on a i7 - 1.90 GHz with 32 GB RAM).
But I also need the MESH and the NCIT ontologies to be included in my merge result.
Overall, I am wondering if that's the correct way to proceed ?
Here follow some questions :
Should I continue with this strategy ? -> Should I keep trying to merge all at once ? In order to give Boomer complete decision power on selecting the best mappings (without introducing any bias)...
Or should I change my merging strategy ? -> Should I split the problem into smaller sub-problems -> Then organize them in some order (according to some criteria) : this could introduce some bias... -> And launch Boomer following this order.
For example, I could try this :
So far, it seems to work much faster. But the problem is the arbitrary order in the for-loop that is introducing a bias : since each equivalence axiom added at one step will influence Boomer results in the next steps.
Any suggestions ?
Oliver
PS : I couldn't attach the Boomer input union ontology (compressed ~ 140 MB) since the maximum attachment size is 25 MB. However, the input ptable is here ptable-91-mappings.zip .