Section 4, Reviewer 3 - Githubissues

"Our java implementation”: java -> Java

"We are experimenting with different setting, typically T = 16”: where is T coming from and what does it denote. Also, ‘setting' -> ‘settings’. What is a setting.

"Over 65k hypothetical axioms were generated from mappings, of which 11,800 were interpreted as equivalence axioms, allowing the safe merging of multiple duplicate classes across ontologies.” Obviously this is a result, but without a sense of false positives and negatives or some other measure, test, or review of accuracy it’s not clear what is means.

Figure 2: I don’t see what, other than confusing clutter, is gained from giving prior probabilities with more than 2 significant digits, let alone > 10 decimal places. Also, the figure, as per the text, is meant to show an example of “how a module has been resolved”, but it takes a long time staring at it to follow this somewhat, and even then it takes much guesswork. I think this is a really important figure, and deserves more creative attention to better convey what exactly about the method is key to understand. For example, at least as a reader I’m coming to this with a notion of before and after (the “weaving together”), but how to see this in the figure as is is not clear at all.

"When we assign prior probabilities, we assume a low error rate in source-provided mappings, and thus for larger modules these are rarely rejected, even if it leads to an overall more probable structure (due to the greedy selection procedure).” I’m not following the logic of this. The first stumbling point is that apparently larger size biases a module towards accepting source-provided mappings, whereas smaller size biases a module towards rejecting them. Perhaps this is true, but if so, why? Is it because of the discretized contribution of each edge to the joint probability that gives an edge in a small module a disproportionately greater weight than in a large module? Explain. Also, consider supplementing with a correlation or regression plot? And then what is undesirable about rarely rejecting something if it leads to a more probable model?

"to determine if the module is broken into semantically distinct submodules”: ‘is broken’ or ‘can be broken’? If the first, I’m not sure I understand.

"which can sometimes detect incorrect mappings”: why? I don’t think this is obvious at all.

"There are a number of probabilistic approaches to ontology mapping, but most are aimed at generating rather than interpreting mapping.” Is there an overview, or a review, or at the very least some representative example(s) that could be cited? Also, does this approach not relate to ontology alignment, which is conspicuously missing here from the contrasting approaches and citations?

"In order to fully evaluate our method, we plan to compare the disease merging results with other combined disease resources such as MedGen and EFO[4].” Isn’t this somewhat in contradiction to the earlier posited motivating issue that a combined disease ontology is lacking? If not, i.e., if these aren’t comparable combined disease ontologies, what are the authors hoping to get out of such a comparison along the lines of a 'full evaluation of the method'?

"Additionally, we apply on other domains such as anatomy and compare to gold standards such as Uberon.” As is, this statement comes across more as gratuitous than as something the authors truly intend to undertake in the near future. Rather than “additionally”, Uberon would seem to provide one of if not the most suited "gold standard” to which to compare a machine-generated amalgamation of constituent ontologies.

-- Review comments reposted with permission from @hlapp

cmungall / kboom-paper

Section 4, Reviewer 3 #6