INCATools / kgcl

Datamodel for KGCL (Knowledge Graph Change Language)
https://w3id.org/kgcl/
MIT License
11 stars 4 forks source link

How to deal with proposed obsoletions (or proposed changes) overall #49

Open cmungall opened 8 months ago

cmungall commented 8 months ago

Currently KGCL has a simple data model where each type represents a change or set or changes. A change object can be thought of as a proposition, and can have metadata added to this.

In some cases, ontologies may want to represent the proposition directly in the ontology without fully enacting the change. This is most prominent in the case of obsoletion, where we may want metadata about a proposed ontology to live in the ontology for a period such as a month or two months, where it is queryable. However, we can imagine this for any change type.

An additional challenge here is that the mechanism for representing propositions in an ontology is not as standardized as for example obsoletion (which is itself not as standardized as it could be). E.g in mondo things may go into a mondo-specific "obsoletion_subset".

Some options here:

One is to discourage the notion of storing propositions in the ontology. If you want to query for propositions (such as proposed obsoletions), then query the GitHub issue and PR repository. There is a clear separation of concerns here: the ontology represents the current state of things, and we use infrastructure intended for propositions to store propositions. However, this is not not an ideal solution, e.g. do we expect all ontology browsers to implement some complex ingest mechanism?

Another is to create a collection of shadow classes, e.g. ProposedObsoletion, ProposedNodeMove, ... This is fairly awkward though.

Another option is to add a flag to all classes such as "partial: bool". The actual changes applied to an apply agent would vary depending on the setting of this flag. We can even imagine having maturity levels etc.

The simplest option might be that if something is a proposition we simply insert the change object as KGCL triples into the ontology. The ontology simply stores its own change. This may encounter resistance as people might like continuing to use familiar mechanisms such as oio:subset, IAO IDs etc.

What will probably sit best with existing ontologies is if there is a way to customize how an apply command works on an ontology specific basis, perhaps making "partial" a parameter on the application function

gouttegd commented 8 months ago

The simplest option might be that if something is a proposition we simply insert the change object as KGCL triples into the ontology.

Trying to figure out how that would look like… Do you already have a clear idea in mind?

Would it be under the form of annotations on the class concerned by the change?

For example:

obsolete UBERON:1111111 with alternative UBERON:2222222, UBERON:3333333

would yield an annotation like this:

AnnotationAssertion(
  Annotation(kgcl:has_nondirect_replacement UBERON:2222222)
  Annotation(kgcl:has_nondirect_replacement UBERON:3333333)
  kgcl:NodeObsoletion UBERON:1111111 ????)

?

gouttegd commented 8 months ago

Syntax wise: If we want to allow changes to be “proposed“ (rather than acted upon directly) on a case-by-case basis (e.g. we want to allow people to describe both changes that must be implemented directly and changes that are merely “proposed“), we could have a maybe keyword in front of all KGCL commands?

For example:

# a change to be implemented directly
obsolete UBERON:11111111
# a change that is proposed
maybe obsolete UBERON:2222222

?

cmungall commented 8 months ago

Trying to figure out how that would look like… Do you already have a clear idea in mind?

probably not!

The two options are

  1. as string encoded KGCL syntax/DSL or json serialization (a bit ugh)
  2. triples exactly as you clearly illustrated

I was thinking more like 2, but there are all kinds of issues here. We are "polluting" the ontology with individuals. We have implicit mapping of rdf predicates to APs and all the dangers that entails. It's quite unnatural from the POV of what we normally do in OBO ontologies. But it's very generic and can be used for any change type....

If we want to allow changes to be “proposed“ (rather than acted upon directly) on a case-by-case basis... we could have a maybe keyword in front of all KGCL commands?

I think this makes sense, but often the "maybe" will be implicit. E.g. the way I think the mondo editors want to work is that it's always a maybe, and there is an explicit state transition from maybe to actual

gouttegd commented 8 months ago
  1. as string encoded KGCL syntax/DSL or json serialization (a bit ugh)

It would be very easy to add information about proposed changes in that form, but if the information is intended to be queryable (e.g. someone wants to know which terms are slated for obsoletion), that may not be the most practical. One would need to first extract the annotation (e.g. with SPARQL) and then to parse the KGCL to figure out what the change is (is it an obsoletion or any other kind of change).

If the idea is already set to store the information directly into the ontology/KG, I’d be inclined to go all the way and store it in a “native” form that can be manipulated/queried as any other contents in the ontology.

  1. triples exactly as you clearly illustrated

So a change about an existing class is rendered as an annotation on the class. Likewise, a change about a relationship (e.g. kgcl:PredicateChange) can be rendered as an annotation on the axiom representing the relationship to be modified.

But I am unsure about a change that proposes to create a new class. Where would we store that (what would we be annotating)? Ideas:

  1. Ontology-level annotations?
  2. Have a “dummy” class in the ontology that represents nothing but is specifically intended to carry the annotations about proposed new classes?
  3. Actually create the proposed new class, and then annotate it with the KGCL annotations (which would serve to mark the class as being “provisional”)?

the way I think the mondo editors want to work is that it's always a maybe, and there is an explicit state transition from maybe to actual

OK. I imagine something as follows:

  1. All KGCL changes are rendered as annotations in the ontology. Each change contains, among other metadata, the date when the change was proposed.
  2. If the editors later decide to reject a change, all they have to do is to remove the corresponding annotation.
  3. Periodically (e.g. every month), we run a apply-pending command that iterates over the KGCL annotations in the ontology, and actually applies any change that has been proposed more than X weeks ago (changes that have been proposed less than X weeks are left as they are until the next run of the apply-pending command — or until they are manually removed).
gouttegd commented 8 months ago

I have a (very) rough, (very) experimental PoC in my wip/provisional-changes branch. It only supports NodeObsoletion for now (with or without replacement or alternatives).

This adds two new commands to my KGCL plugin for ROBOT:

gouttegd commented 8 months ago

If we do end up storing the provisional changes as annotations, I’d suggest that we avoid using anonymous individual like in my example above. They cause at least two issues:

Instead, I’d suggest something like this:

AnnotationAssertion(
  Annotation(kgcl:has_nondirect_replacement UBERON:2222222)
  kgcl:NodeObsoletion UBERON:1111111 "Proposed for obsoletion"^^xsd:string)

where the annotation value is a small, human-readable string.

This would make the provisional change nicer to visualise in Protégé: Screenshot 2024-02-09 at 11 33 53

and make it perfectly serialisable in the OBO format:

property_value: https://w3id.org/kgcl/NodeObsoletion "Proposed for obsoletion" xsd:string {https://w3id.org/kgcl/has_nondirect_replacement="UBERON:2222222"}
gouttegd commented 8 months ago

Or, we could have a dedicated annotation property (e.g. kgcl:PendingChange) and use the type of change (e.g. kgcl:NodeObsoletion, kgcl:NodeMove, etc.) as the annotation value, as in:

AnnotationAssertion(Annotation(kgcl:has_nondirect_replacement UBERON:2222222) kgcl:PendingChange UBERON:1111111 kgcl:NodeObsoletion)

I think this would make slightly more sense, and it would also make it slightly easier to later extract the annotations to apply the pending changes – just extract all the kgcl:PendingChange annotations, instead of extracting all the annotations with an IRI in the kgcl: namespace).

cmungall commented 8 months ago

This is quite ingenious

Of course it’s not ideal having two mappings to rdf/owl but the lack of quoting in rdf (sigh) makes the direct form impractical as you point out

We might want to have a different namespace for the AP translation just in case anyone ever wants to combine?

The lack of uniformity between NewX and other changes on X is bothering me a little. In some ways having all proposals be ontology annotations is more balanced. But having it be on the about entity is more direct and visible. But maybe having it on the ontology makes it easier to see all pending changes in one place? Like the TODOs at the top of a file?

Curious what others think!

On Fri, Feb 9, 2024 at 4:02 AM Damien Goutte-Gattat < @.***> wrote:

If we do end up storing the provisional changes as annotations, I’d suggest that we avoid using anonymous individual like in my example above. They cause at least two issues:

-

They are not rendered very nicely in Protégé: Screenshot.2024-02-09.at.10.57.03.png (view on web) https://github.com/INCATools/kgcl/assets/53821801/55dcdf49-8433-4bb3-ba31-00384db1315c

More importantly, they cannot be stored in an OBO file (not even, surprisingly, in the owl-axioms header tag), which is obviously a problem for ontologies that uses this format as their edit format. Sure, this can be worked around by storing the provisional changes in a dedicated component in another format, but that’s not great.

Instead, I’d suggest something like this:

AnnotationAssertion( Annotation(kgcl:has_nondirect_replacement UBERON:2222222) kgcl:NodeObsoletion UBERON:1111111 "Proposed for obsoletion"^^xsd:string)

where the annotation value is a small, human-readable string.

This would make the provisional change nicer to visualise in Protégé: Screenshot.2024-02-09.at.11.33.53.png (view on web) https://github.com/INCATools/kgcl/assets/53821801/f553aac3-1ab8-4d64-a6b3-7b4a7d16b03b

and make it perfectly serialisable in the OBO format:

property_value: https://w3id.org/kgcl/NodeObsoletion "Proposed for obsoletion" xsd:string {https://w3id.org/kgcl/has_nondirect_replacement="UBERON:2222222"}

— Reply to this email directly, view it on GitHub https://github.com/INCATools/kgcl/issues/49#issuecomment-1935807394, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOMHSY4P6DKELK2I2BDYSYF63AVCNFSM6AAAAABC4LFU72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZVHAYDOMZZGQ . You are receiving this because you authored the thread.Message ID: @.***>

gouttegd commented 8 months ago

The lack of uniformity between NewX and other changes on X is bothering me a little.

Me too.

One particular problem of class creation changes is that such changes would almost never occur in isolation. At the very least, a kgcl:ClassCreation change would be accompanied by a kgcl:PlaceUnder change to give the new class a position within the hierarchy, and most likely a whole bunch of other changes as well (e.g. a kgcl:NewTextDefinition to define the class, one or several kgcl:EdgeCreation to establish relationships, maybe some kgcl:NewSynonym, etc.).

I am reluctant to use ontology-level annotations to represent all those changes that pertain to a single (new) class. I am not even sure I have a clear idea of what that would look like.

For now, I think that actually creating the new class and then annotate it with all the other changes relevant for that class is the best option – but I am also curious of what other folks have to say.

An obvious problem of creating the new class immediately is that there is a risk that people start using the class, maybe because they don’t realise it is provisional. But that’s something that can be checked against in CI.

gouttegd commented 8 months ago

For clarity, the proposed storage methods envisioned so far are (assuming, for the example, that we want to store the change represented by obsolete EX:1234 with alternative EX:5678,EX:6789):

A. String annotation using the KGCL DSL serialisation format

AnnotationAssertion(
  Annotation(dcterms:date "2024-02-14"^^xsd:date)
  Annotation(dcterms:creator ORCID:1111-2222-3333-4444)
  kgcl:NodeObsoletion EX:1234 "obsolete EX:1234 with alternative EX:5678,6789")

Simple, but not great as it requires whoever wants to know what the change is to (re-)parse the KGCL change again.

B. Parameters stored in annotations, value is an individual

AnnotationAssertion(
  Annotation(dcterms:date "2024-02-14"^^xsd:date)
  Annotation(dcterms:creator ORCID:1111-2222-3333-4444)
  Annotation(kgcl:has_nondirect_replacement EX:5678)
  Annotation(kgcl:has_nondirect_replacement EX:6789)
  kgcl:NodeObsoletion EX:1234 _:x)

The use of an anonymous individual is annoying (it’s not displayed in an user-friendly manner in Protégé and cannot be serialised properly in the OBO flat file format).

C. Similar, but value is a human-readable string

AnnotationAssertion(
  Annotation(dcterms:date "2024-02-14"^^xsd:date)
  Annotation(dcterms:creator ORCID:1111-2222-3333-4444)
  Annotation(kgcl:has_nondirect_replacement EX:5678)
  Annotation(kgcl:has_nondirect_replacement EX:6789)
  kgcl:NodeObsoletion EX:1234 "Proposed for obsoletion")

Intended to address the issues with B (displayable nicely in Protégé, serialisable in OBO).

D. Value is the type of KGCL change

AnnotationAssertion(
  Annotation(dcterms:date "2024-02-14"^^xsd:date)
  Annotation(dcterms:creator ORCID:1111-2222-3333-4444)
  Annotation(kgcl:has_nondirect_replacement EX:5678)
  Annotation(kgcl:has_nondirect_replacement EX:6789)
  kgcl:PendingChange EX:1234 kgcl:NodeObsoletion)

Also reasonably user-friendly, slightly more developer-friendly (just one annotation property to look for when querying for pending changes in an ontology). That’s what is currently implemented in KGCL-Java.

(Note that kgcl:PendingChange is an IRI that I minted on the spot and which does not currently exist in the KGCL namespace; we could also simply use kgcl:Change as the annotation property, but I like the idea of using an AP whose name makes it clear that we’re talking about a proposed change. I have no strong opinion on that, though.)

matentzn commented 7 months ago

I just finally got to read this proposal. This is extensive, and I think I like the general

direction. It seems like a complex system for something that we have not even managed to convince people of - clearly documenting intended changes. I think most ontologies will just continue to go ahead with the change (whether this is good or bad is a different story). I am against "proposed additions" from an SOP perspective, but can see "proposed reclassification" and "proposed obsoletion" to be potentially valuable.

I would advocate for system purely build around annotation properties, and avoid individuals, anonymous or otherwise. Some of our tools like SLME are pretty unreasonable when it comes to individuals (including even signature disjoint individuals). I dont know about anonymous individuals but my guess is we will have some OBO format issues with that.

FWIW, my favourite is D (assuming that kgcl:PendingChange and kgcl:NodeObsoletion are both APs):

AnnotationAssertion(
  Annotation(dcterms:date "2024-02-14"^^xsd:date)
  Annotation(dcterms:creator ORCID:1111-2222-3333-4444)
  Annotation(kgcl:has_nondirect_replacement EX:5678)
  Annotation(kgcl:has_nondirect_replacement EX:6789)
  kgcl:PendingChange EX:1234 kgcl:NodeObsoletion)

This allows me to SPARQL for all pending changes at once.

gouttegd commented 7 months ago

I dont know about anonymous individuals but my guess is we will have some OBO format issues with that.

Definitely. Though to be honest I don’t care that much about that. People who want to be able to use new features such as this one should be ready and willing to let go of old stuff like the OBO format (or be willing to invest the time to make the format evolve – and at the moment you say that, suddenly there’s nobody around anymore).

If we can store “provisional changes“ in such a way that they can be preserved in the OBO format, that’s good, but if we can’t, I’ll just shrug. “Yeah, it won’t work in OBO. What did you expect? We can’t keep retrofitting new stuff into an old format.”

FWIW, my favourite is D

Good, because that’s we currently have in KGCL-Java. :) Modulo two things:

This allows me to SPARQL for all pending changes at once.

Yep, that was the idea behind this form.

I also think it is shown nicely in Protégé, for example

obsolete CL:0000028 with replacement CL:0000029

is shown as:

Screenshot 2024-02-22 at 15 08 05

and

create edge CL:0000029 rdfs:subClassOf CL:0000030

is shown as:

Screenshot 2024-02-22 at 15 08 46

And it’s not too ugly in the OBO format either:

property_value: https://w3id.org/kgcl/PendingChange https://w3id.org/kgcl/NodeObsoletion {http://purl.org/dc/terms/date="2024-02-22T14:43:00Z", http://w3id.org/kgcl/has_direct_replacement="CL:0000029"}
gouttegd commented 7 months ago

I also think it is shown nicely in Protégé, for example

Of course we could also have a KGCL plugin for Protégé that recognises these annotations and displays the pending changes in an even nicer way and… no, I’ll stop this line of thought right now.

cmungall commented 7 months ago

I don't think OBO Format limitations are relevant for this. Regardless of expressivity, can keep using OBO format (sorry), and just store these in a separate imported .owl file, we have this pattern for a lot of ontologies.

Message ID: @.***>

gouttegd commented 7 months ago

Regardless of expressivity, can keep using OBO format (sorry), and just store these in a separate imported .owl file

Right. I can add a --write-to option or similar to the apply command to write the new axioms to a separate file rather than merging them into the output ontology.

cmungall commented 6 months ago

It seems we are tending towards (D) and it is quite a practical solution.

However, I want to fully articulate my original idea of a direct representation. Every instance in KGCL can be represented either is the KGCL DSL, or as a standard linkml serialization of the underlying data

Class: NodeRename

Command: rename GO:0005635 from 'nuclear envelope' to 'foo bar'

YAML:

id: CHANGE:001
type: NodeRename
old_value: nuclear envelope
new_value: foo bar
about_node: GO:0005635
about_node_representation: curie

Turtle:

@prefix CHANGE: <http://example.org/> .
@prefix GO: <http://purl.obolibrary.org/obo/GO_> .
@prefix kgcl: <http://w3id.org/kgcl/> .

CHANGE:001 kgcl:about_node GO:0005635 ;
    kgcl:about_node_representation "curie" ;
    kgcl:new_value "foo bar" ;
    kgcl:old_value "nuclear envelope" ;
    rdf:type kgcl:NodeRename .

I think the most elegant and long-term maintainable approach is to use this direct RDF form, augmented with standard vocabularies to represent things such as pending status (see for example https://www.w3.org/TR/vocab-dcat-3/#life-cycle). E.g

CHANGE:001 kgcl:about_node GO:0005635 ;
    kgcl:about_node_representation "curie" ;
    kgcl:new_value "foo bar" ;
    kgcl:old_value "nuclear envelope" ;
    rdf:type kgcl:NodeRename .
    pav:status ISO19115:pending .
    dcterms:creator orcid:...,
    dcterms:date ...,
    rdfs:seeAlso <github url for discussion>

It involves no new mappings, no new annotation properties. Semantically and entailment-wise it behaves entirely as expected. I can query for NodeChanges and get all asserted instances of subclasses of NodeChange. There is no need to develop new sparql queries to check for things such as accidental use of annotation properties. For example, if I accidentally make triples such as:

CHANGE:001 kgcl:about_node GO:0005635 ;
    kgcl:about_node_representation "curie" ;
    kgcl:new_value "foo bar" ;
    kgcl:new_value "foo baz" ;
    kgcl:old_value "nuclear envelope" ;
    rdf:type kgcl:NodeRename .

Then existing mechanisms will flag this.

The existing triples could be loaded into massive triplestore of all changes in all obo ontologies, with powerful querying over the direct representation, using the standard kgcl vocabulary.

While I think this is elegant, clean, the correct way to do it, and the approach with the best long term maintainability and minimal cognitive overhead, I also reluctantly accept that this way also has some short term downsides due to our OWL stacks making various assumptions about individuals, and how confused people get by punning in OWL. While I think these problems are solvable, I don't have an immediate answer to how to resource fixing them.

So I am likely to accept mapping all triples (including rdf:type) to annotation properties. It's just one more mapping and piece of tacit knowledge. But I wanted to make sure the full proposal was given due consideration.

gouttegd commented 6 months ago

We are not tending towards anything.

I implemented (D) only for a small subset of changes (2 or 3, I don’t even remember), so that we can test how it works. It’s not set in stone.

For now, my main concern with this whole idea is that it seems to be nothing more than a discussion between the two of us. I have yet to see any hint that other people are interested, which gives me very little motivation to go any further, in any direction.

matentzn commented 6 months ago
Just an aside for @gouttegd: - When ODK was created, no one cared and talked, now 100+ repos. - When SSSOM was created, some people were interested, no one talked. Now rolled out in many projects (many of the discussions there are still just between you and me! Remember https://github.com/mapping-commons/sssom/issues/328) KGCL is a complex issue, and this particular feature here even more so; I would not expect any specific input until people notice how your proposal will affect their files and tools..
gouttegd commented 6 months ago

I would not expect any specific input until people notice how your proposal will affect their files and tools..

Well then, don’t expect any work on that proposal from me until that happens. (And if it does not happen, so be it.)

Motivation matters. I do most of my work on KGCL on my own free time (look at the history of KGCL-Java: 112 out of 136 commits are associated with my personal email address and not my Cambridge email address, which means I made those commits from my personal machine outside of my work hours), because it’s far too removed from the work I am actually paid to do for me to be comfortable working on it during my work hours.

As far as I am concerned, KGCL-Java is merely one of the several free software projects that I either develop or contribute to. It’s nothing to do with my work. Which means, among other things, that I work on it if and when I want, and that it is in “competition” with those other free software projects for my free time. If I am not motivated to work on it, I don’t.

I myself have almost zero use for KGCL. And certainly zero use at all for the “storing provisional changes in the ontology” feature. So without any hint that the feature is going to be useful for someone, I am unlikely to do anything more than what I did so far. I’d much rather spend my free time working on SSSOM, which I do use and for which I have (too many) ideas of things to improve.

So, again, motivation matters. You can’t rely on people inventing completely new features in isolation, only for you to come and say “oh cool, I’ll use that, thanks!“. Well, you can, but you’re gonna wait for a long time. You want new features, you have to participate at some point, if only to say what you would like.

nlharris commented 6 months ago

@gouttegd I know I speak for many when I express how grateful we are that you do so much work for KGCL, Uberon and other projects on your own time. I totally get wanting to get confirmation that something will be useful before pouring time into it. After all, if you're not paid for your work on a project, the main reward you get is when people use (and, even better, build on) what you've written. I will also note that it's really hard to get feedback from people about ideas. I don't know how many times I asked potential users whether they wanted the UI to work this way or that way and got basically no response, and then when I pushed out a release, people were all like, "Oh, not THAT way." 🙄

matentzn commented 6 months ago

I myself have almost zero use for KGCL. And certainly zero use at all for the “storing provisional changes in the ontology” feature. So without any hint that the feature is going to be useful for someone, I am unlikely to do anything more than what I did so far. I’d much rather spend my free time working on SSSOM, which I do use and for which I have (too many) ideas of things to improve.

@gouttegd I can't argue with that :) I see it the same, and I personally divide my time across things by the same measure! Too many construction sites, too few people to stem the tide.

gouttegd commented 6 months ago

you do so much work for KGCL, Uberon and other projects on your own time.

The work I do for Uberon and CL is on my paid time. :) Part of my remit is basically ”anything that can make cross-species scRNAseq studies easier to do”, so contributing to cross-species ontologies fits without a doubt.

(And I suppose I could argue that, because KGCL is being used on Uberon and CL, it could also be shoehorned into that remit, but I believe this is too far-fetched.)

Anyway, sorry for the off-topic. Back to KGCL provisional changes!