INCATools / ontology-development-kit

Bootstrap an OBO Library ontology
http://incatools.github.io/ontology-development-kit/
BSD 3-Clause "New" or "Revised" License
223 stars 54 forks source link

Remove dependency to OWLTools in standard workflows? #622

Open gouttegd opened 2 years ago

gouttegd commented 2 years ago

While most of the standard, ODK-generated workflows use ROBOT, there are still rules where OWLTOOLS is used:

$(SUBSETDIR)/%.owl: $(ONT).owl | $(SUBSETDIR)
        $(OWLTOOLS) $< --extract-ontology-subset --fill-gaps --subset $* -o $@.tmp.owl && mv $@.tmp.owl $@ &&\
        $(ROBOT) annotate --input $@ --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) -o $@.tmp.owl && mv $@.tmp.owl $@
normalize_obo_src: $(SRC)
        $(OWLTOOLS) $< --merge-axiom-annotations -o -f obo $(TMPDIR)/NORM.obo && $(ROBOT) convert -i $(TMPDIR)/NORM.obo -o $(TMPDIR)/NORM.tmp.obo && mv $(TMPDIR)/NORM.tmp.obo $(SRC)

Since we want to make it clear that ROBOT is the modern tool and that OWLTOOLS are no longer maintained, we should probably investigate how to replace the use of OWLTOOLS in those rules.

gouttegd commented 1 year ago

In the ODK’s standard workflow, owltools is still used in two places.

Subset generation

Subset ontologies are produced with the owltools --extract-ontology-subset command:

$(SUBSETDIR)/%.owl: $(ONT).owl | $(SUBSETDIR)
        $(OWLTOOLS) $< --extract-ontology-subset --fill-gaps --subset $* -o $@.tmp.owl && mv $@.tmp.owl $@ &&\
        $(ROBOT) annotate --input $@ --ontology-iri $(ONTBASE)/$@ $(ANNOTATE_ONTOLOGY_VERSION) -o $@.tmp.owl && mv $@.tmp.owl $@

What the --extract-ontology-subset command does mostly resides in the makeMinimalSubsetOntology method of owltools’ owltools.mooncat.Mooncat class, but basically it extracts a coherent, minimal subset ontology around the terms carrying a given oboInOwl:inSubset annotation.

It does not seem that this command can easily be replaced by other existing tools. I tried two approaches:

Normalizing the source file

Owltools’ --merge-axiom-annotations command is used when normalizing a OBO-format source file:

normalize_obo_src: $(SRC)
    $(OWLTOOLS) $< --merge-axiom-annotations -o -f obo $(TMPDIR)/NORM.obo &&\
    $(ROBOT) convert -i $(TMPDIR)/NORM.obo -o $(TMPDIR)/NORM.tmp.obo &&\
    mv $(TMPDIR)/NORM.tmp.obo $(SRC)

The --merge-axiom-annotations command merges axioms that are logically equivalent while making sure that all the annotations on the original axioms are kept (see https://github.com/owlcollab/owltools/blob/master/OWLTools-Runner/src/main/java/owltools/cli/CommandRunner.java#L1349).

Again, it does not look like something that can easily be replicated by other tools, AFAIK.

The bottom line is that while owltools is now only used for 2 things (at least in the standard workflow; in Uberon’s very much non-standard workflow, it is also used to generate the composite-* products), for those things it appears irreplaceable, at least without a significant effort.

The question becomes, do we want to get rid of owltools enough to make that significant effort?

gouttegd commented 1 year ago

@matentzn An opinion as to whether removing owltools from the standard workflow would be worth the effort that it would require?

balhoff commented 1 year ago

This robot PR is relevant to the subset issue: https://github.com/ontodev/robot/pull/1000 Not sure if will be an exact replacement of the owltools functionality.

gouttegd commented 1 year ago

I didn’t know some efforts were already under way. Thanks, that’s good to know!

gouttegd commented 1 year ago

I’ve tested ROBOT’s new extraction method (extract --method subset --term-file ...). It yields results that are similar to owltools --extract-ontology-subset, but only when the owltools command is called without the --fill-gaps option.

I have found no way of reproducing with ROBOT the same kind of subsets produced by owltools --extract-ontology-subset --fill-gaps.

matentzn commented 1 year ago

The --fill-gaps option is very crucial for this command - this was what the whole business with the ROBOT subset command was all about; can you characterise how the two approaches appear to differ?

gouttegd commented 1 year ago

Sorry, no. The approach used by OWLTools is implemented here: https://github.com/owlcollab/owltools/blob/9faa4f42b761839a26e8c8096cd24044e2bdcfc7/OWLTools-Core/src/main/java/owltools/mooncat/Mooncat.java#L832

If you believe I can tell how it differs from the approach used in ROBOT (https://github.com/ontodev/robot/blob/2345420d04ab29b1d7087f22e3a666295ece6002/robot-core/src/main/java/org/obolibrary/robot/ExtractOperation.java#L235), or whether it boils down to the same algorithm as proposed by Chris Mungall (https://github.com/ontodev/robot/issues/497#issuecomment-975873714), well, I appreciate your confidence in my understanding of graph theory, but I’m afraid that confidence is severely misplaced.

gouttegd commented 1 year ago

A few observations, though.

Here is how I use the “extract subset” method, based on how I understood it was supposed to be used (I did my tests with the BDS_subset of CL):

$ robot filter --input cl.owl --prefix 'cl: http://purl.obolibrary.org/obo/cl#' --select 'oboInOwl:inSubset=cl:BDS_subset' export --header ID --export bds_subset_terms.txt
$ robot extract --input cl.owl --method subset --term-file bds_subset_terms.txt --output bds_subset.owl

The first command merely extracts a list of all the terms annotated as being in the subset, while the second does the actual subset extraction.

But this command produces almost exactly the same subset as the following simple filter command:

$ robot filter --input cl.owl --prefix 'cl: http://purl.obolibrary.org/obo/cl#' --select annotations --select 'oboInOwl:inSubset=cl:BDS_subset' -o bds_subset.owl

So either

Of note, according to Chris Mungall the subset command was supposed to be merely a shorthand for

filter --preserve-structure true --use-all-relations true --select annotations -select "oboInOwl:inSubset=subset:$SUBSET"

except that, unless again I am missing something, there is no such option as --use-all-relations. And I note that the “subsets” generated by the two sets of commands above have in common that they contain absolutely no relations at all (no object properties), which I suspect might be an important clue (to me it looks like the subset extraction method is actually ignoring relations when it does its trick).

gouttegd commented 1 year ago

Even if I forcefully include the relationships I want in the subset (ROBOT’s documentation for extract seems to suggests this is needed, i.e. ROBOT will not automatically include the relations), the extracted subset will then contain the definitions of the object properties but they will not be used at all (none of the classes in the extracted subset will have any relations).

matentzn commented 1 year ago

I appreciate your confidence in my understanding of graph theory, but I’m afraid that confidence is severely misplaced.

Hahha sorry, I should have been more clear. While I do have the confidence that you could with a bit of time characterise the algorithms, what I really meant to say is "describe the difference in the output at a high level", i.e. the one has 1000 less is a relations than the other and 10K more part of, or some such, which is what you proceeded to do afterwards! Thank you!

I am not concerned I think the subset stuff we have in ROBOT now is superior to OWLTools and we should just retire it, and see who screams.

gouttegd commented 1 year ago

what I really meant to say is "describe the difference in the output at a high level", i.e. the one has 1000 less is a relations than the other and 10K more part of, or some such

Well, you can have a look for yourself.

Here is the subset generated by owltools --extract-ontology-subset: bds_subset_owltools_nofillgaps.owl.txt

It contains precisely the 65 classes defined in the subset. It also contains almost all the object properties from the original ontology, but they are not used (none of the 65 classes in the subset has any relation to anything).

Here is the subset generated by owltools --extract-ontology-subset --fill-gaps: bds_subset_owltools_fillgaps.owl.txt

It contains 669 classes (including the 65 from the subset itself). It contains the same object properties than the previous one, but here they are used. All 669 classes have their full set of relations.

Here is the subset generated by robot extract --method subset --term-file subset.txt (where subset.txt contains the list of terms defined in the subset, obtained by a previous filter --select 'oboInOwl:inSubset=cl:BDS_subset' command): bds_subset_robot_extract_subset.owl.txt

It contains only the 65 classes of the subset itself. They have no relations (the object properties themselves are absent from the subset).

If I explicitly add the relations to the term-file argument (which I believe is necessary because of the change discussed here), this is the generated subset: bds_subset_robot_extract_subset_with_relations.owl.txt

It still only contains only the 65 classes of the subset itself. The object properties are present in the output, but they are not used.

gouttegd commented 1 year ago

Incidentally, the ROBOT version is horrendously slower than the OWLTOOLS version: on CL, robot extract --method subset takes more than 3 minutes (~195 seconds) while owltools --extract-ontology-subset takes ~15 seconds.

gouttegd commented 1 year ago

I think the subset stuff we have in ROBOT now is superior to OWLTools and we should just retire it

What happened to “the --fill-gaps option is very crucial for this command”? I found no way of doing any kind of “gap filling” with ROBOT. As you can see in the examples above, the subset extracted by ROBOT always only contains the very terms marked with the inSubset annotation, and nothing more.

If there is a way to do with ROBOT what is done with owltools --extract-ontology-subset --fill-gaps, I would very much like to know it – that’s kind of what this entire ticket is about!

gouttegd commented 1 year ago

I wonder if there has been a misunderstanding between “gap filling” and “gap spanning”. All the discussion in the ROBOT ticket about the requested new subset command seems to be about “gap spanning” (ensuring relations are preserved in the subset, even if they are “indirect” relations that involve some intermediates classes that are not in the subset).

The “gap filling” done by owltools --extract-ontology-subset --fill-gaps is about including intermediate classes in the subset, something that seemingly has never been proposed as a goal of ROBOT’s new subset command.

balhoff commented 1 year ago

Incidentally, the ROBOT version is horrendously slower than the OWLTOOLS version: on CL, robot extract --method subset takes more than 3 minutes (~195 seconds) while owltools --extract-ontology-subset takes ~15 seconds.

It's a completely different algorithm; it's running relation-graph internally which is pretty intensive (but logically complete).

cmungall commented 1 year ago

I made a separate issue for the gap-filling (include intermediates) option:

If this is implemented AND we are satisfied with efficiency THEN I believe we can remove owltools

Regarding efficiency, it wasn't clear to me whether comments about robot subset was using RG with a property subset or all properties. Note that even if this is addressed, there is still room for a more efficient operation that uses HOP over ENTAILMENT. See OAK docs for an explanation of this: https://incatools.github.io/ontology-access-kit/guide/relationships-and-graphs.html#graph-traversal-strategies