OBOFoundry / OBOFoundry.github.io

Metadata and website for the Open Bio Ontologies Foundry Ontology Registry
http://obofoundry.org
Other
161 stars 201 forks source link

Provide a means of diffing between ontology releases and PRs in github #500

Closed cmungall closed 6 years ago

cmungall commented 6 years ago

Apologies for the somewhat all-encompassing ticket, not sure where the best place is.

We have two similar problems

  1. We would like to see visually informative diffs between ontology releases, either dynamically, or as part of release notes (or both). This includes high-level summaries, as well as ability to drill down to per-axiom changes, grouped by OWL classes (ie frame-oriented)
  2. It would be useful to have a web service or command line tool to see a meaningful diff for any git commit or github PR. Even when functional or manchester syntax is used for source, interpreting diffs is very hard see this post. Many groups still keep source in .obo for this reason.

We have a variety of approaches at the moment. An incomplete summary:

Uberon uses the deprecated owljs differ to generate markdown for each release, example (css not working). The markdown is then hand-edited and mixed with github commit logs.

@pbuttigieg wins the award for best ontology release notes for ENVO, example. Some automation used here?

OLS maintains copies of previous releases, and provides extremely nice visuals summarizing diffs at a high level, .e.g https://www.ebi.ac.uk/ols/ontologies/uberon

image

Doesn't seem possible to drill down but I bet that's on the cards.

The QuickGO change log feature by @TonySawfordEBI (example - scroll to bottom) is popular with GO editors. This is per-class.

image

Many of us use git-owl-tools to get nice colorful meaningful labelified diffs on the command line. This combines an owl diffing tool plus some logic for integrating with git. Unfortunately no web version.

For plain diffing two OWL files, there are a variety of solution, robot diff (seems slow?) owljs-diff (deprecated), bubastis. These produce a variety of outputs. E.g. owljs makes markdown, robot makes plaintext, bubastic xml than can be transformed with xslt.

Protege has...?

Approach

I'm imagining loose coordinating at different levels.

A core JVM library for diffs that we can all use. Getting Axiom diffs is fairly trivial. Additional logic for summing operations could be useful (e.g. recognizing "moves", "obsoletions"). As would core logic for aggregate statistics, e.g. new classes. This could all potentially live in the owlapi.

Agreement on some kind of standard output of diffing tools - markdown, javascript?

A lightweight web portal where you can paste in a git commit id or PR URL and have reasonably attractive HTML rendered (might there be some way to integrate this into github as a kind of hook?).

A command line tool that will generate markdown for the diff between a pre-release and the last release, so ontology maintainers can create attractive release notes, all following a similar style (see uberon and envo release notes above). Ideally this would be interwoven with something that mines the git commit logs, so we can see that in a commit that closes #123, an axiom was moved from class A to class B...

Also: web portal where you can provide any two versionIRIs for an ontology and see the HTML.

@simonjupp Does it make sense to subsume some of this into OLS? Can parts of the OLS code be used, or does that require the full neo4j infrastructure? This would seem overweight for some of the use cases like making release notes for an ontology.

simonjupp commented 6 years ago

Our OLS diff tool isn't that portable at the moment and is kind of separate from our core OLS backend. It came out of an EU project we were involved in and uses SPARQL to identify changes between two RDF graphs, it is limited to fairly simple changes that we can detect at the triple level (which to be fair is most of the important ones people care about). Logical diffs would be harder.

Bubastis is OWL API based and provides axiom levels diffs, it's getting a bit old now but has the basics for a lot of what we need. The problem for OLS was that it didn't scale very well for big ontologies.

I was planning to look into how diff worked in robot, wonder if it slow for the same reasons as bubastis?

jamesaoverton commented 6 years ago

Yes, robot diff uses OWLAPI to get sets of axioms, then prints the differing axioms as strings, replacing IRIs with labels for human-readability. The string rendering could probably be fast (even if the current implementation is not) and support whatever output format, but OWLAPI is still a constraint. Our larger ontologies require a lot of time and space to load using OWLAPI, and loading two versions will take twice the resources.

It sounds like Chris is assuming OWLAPI and asking for various improvements.

If you don't want to use OWLAPI, I can see a few approaches.

SPARQL is fine until you want to compare blank nodes, and OWL uses lots of blank nodes. Fortunately the axiom representations using blank nodes form trees, which are easier to compare than an arbitrary graph, and the tree shape is visible in the XML.

If we have two OWLXML files ('left' and 'right') that were both serialized by a recent OWLAPI, then the structure is pretty predictable. It should be possible to do a more semantic diff (without OWLAPI) by aligning by Class/ObjectProperty/etc. and comparing the top-level XML element on the left with the one on the right. For the ones that differ, you'd then render the XML to whatever output format. I think this could be fast, but the implementation might be tricky, and it won't work for other serialization formats.

I'll also mention that I'm still working on a human-readable text format for RDF/OWL. I presented an earlier version at ICBO 2016 (https://github.com/ontodev/howl), and I'm using a newer version in a production system (https://github.com/IEDB/ONTIE/blob/master/ontology/ontie.kn) without much logic, but I'm in the middle of a rewrite.

balhoff commented 6 years ago

I hacked together a little thing: https://stars-app.renci.org/owldiff/

E.g.: https://stars-app.renci.org/owldiff/diff?right=https://github.com/obophenotype/human-phenotype-ontology/raw/eb792f57e60a1a8209834048ca326e55f8fbaa4a/src/ontology/hp-edit.owl&left=https://github.com/obophenotype/human-phenotype-ontology/raw/master/src/ontology/hp-edit.owl&loadimports=true

Code is here: https://github.com/balhoff/owl-diff

cmungall commented 6 years ago

I think this is a pretty good solution for now!

jonquet commented 6 years ago

Missed this in October. Just CC @graybeal @jvendetti @mdorf for information. Hopefully we can do something to in BioPortal and AgroPortal.

graybeal commented 6 years ago

Agree that this is (a) great to have, (b) the right direction. An ideal code set would be able to run as a service wherever deployed (including as a web service, as you've shown here), respond to REST API requests (check!), and provide different views on request (this being one, something graphical being another). So it really seems like a great start.

balhoff commented 6 years ago

@graybeal the service I wrote just generates HTML at the moment, but it would be pretty simple to tweak it to output JSON to support alternative views. Let me know if there is any interest in further development.

jvendetti commented 6 years ago

There wasn't any mention in this issue of what BioPortal currently does, so I'm adding it here for clarification. Like OLS, we maintain copies of all previous releases. Each time a new release is provided, we generate a plain diff using Bubastis. These diffs are available for download (in XML format) from the "submissions" table in the BioPortal UI. Screenshot of UBERON submissions table:

screenshot 2017-11-30 10 47 45

We lack a graphical representation of these generated diffs and there's no access provided via the REST interface.

On the subject of 3rd party diff tools - Rafael Goncalves (current member of the Protege team) wrote an OWL API based diff tool called Ecco. It provided a web application and a command line tool, both of which could generate XML change set files and transformations into HTML. I asked him about it just now and he said it's not really production quality, and there hasn't been any funding to maintain it since it was initially developed in 2012.

cmungall commented 6 years ago

I assume the json structure will be a list of diff objects, where the diff object is something like a left and right axiom. Any thoughts on how this should be represented? A generic translation of something like functional syntax?

obojson is frame-oriented so not really suited here. But it might be useful to have the option to abstract composite change operations to higher level operations more suited to what users expect, e.g.

graybeal commented 6 years ago

Gee, and it would be super-slick if it could be smart enough to detect global changes as a single composite change. But are these abstractions realistic? It took Word a long time to say something about a move...

In any case, having a tag in each diff object indicating the nature of the difference would be very useful. Could imagine English-language presentations from that.

cmungall commented 6 years ago

One of the challenges is that some of this blends into individual groups policies. E.g. for OBO there is a specific sequence of ops that constitute a merge. Matt is building some of this in a configurable way into Protege so some of this may become standard.

Since we're on wishlists, our users really loved Amelia's diffs of text definitions, e.g. https://groups.google.com/forum/#!topic/obo-diffs/AuAyueSgLSE (it's actually colored if you get the email directly, v pretty)

unfortunately code is perl but presumably not hard to port