INCATools / kgcl

Datamodel for KGCL (Knowledge Graph Change Language)
https://w3id.org/kgcl/
MIT License
11 stars 4 forks source link

Creating a new class with an auto-allocated ID #56

Open gouttegd opened 7 months ago

gouttegd commented 7 months ago

Currently, users wanting to create a new class using KGCL are expected to know in advance the ID of the class to be created, so that they can issue a create class ID:1234 "label" command.

This is hardly compatible with the intended use of KGCL in bug tracker tickets.

There would be several ways to address the problem.

A. Non-technical solution. Leave KGCL as it is, but expect that ontologies should have a ID range specifically intended for KGCL change and document that range to users.

Not ideal as it puts all the burden of allocating the ID to the users (who must first figure out what is the range allocated to KGCL-mediated changes, and then find out what is the lowest non-used ID in that range).

This is, in effect, the current situation.

B. Deal with auto IDs at the level of the Ontobot. Leave KGCL as it is. Agree on a special keyword (for example ID:auto) to use in the KGCL DSL syntax, and have Ontobot automatically replace that keyword by a suitable auto-generated ID before actually passing the KGCL data to the KGCL library. It’s up to the Ontobot to figure out how to allocated ID (probably by parsing the -idranges.owl file, if such a file exists).

C. Similar as B, but at the level of KGCL itself. That is, the KGCL DSL explicitly defines the ID:auto keyword, and KGCL libraries are expected to know that they should automatically allocate an ID when this keyword is used.

I currently think this would be the best solution.

Both B and C would allow an user to something like this:

create class ID:auto "new label"
add definition "new definition" to ID:auto
create edge ID:auto rdfs:subClassOf EX:1234

D. Add variables to the KGCL DSL. Make it possible to do something like this:

let my_new_class = create class "new label"
add definition "my definition" to my_new_class
create edge my_new_class rdfs:subClassOf EX:1234

Technically speaking the most elegant solution, but I don’t think we want to add such constructs to the KGCL DSL syntax – which is expected to be a simple syntax for mostly non-technical users.

hrshdhgd commented 7 months ago

I like [C] but we should also have guardrails to make sure a permanent ID is assigned to the entity before the PR is merged. Else it would open all kinds of curation nightmares. Obviously it will not happen considering the fact that curators are the gatekeepers of what goes into an ontology or not. But do you think if KGCL can even enforce something like that? One option is to add merge rules in GitHub but haven't done it enough to confidently say if that is even a possibility. There is a uuid: assigned internally but that is a change ID rather than an entity ID.

balhoff commented 7 months ago

Maybe this is a good use case for implementing robot mint.

gouttegd commented 7 months ago

we should also have guardrails to make sure a permanent ID is assigned to the entity before the PR is merged.

The way I was envisioning this, the permanent ID would be assigned by the KGCL engine when the KGCL data is processed – so by the time a PR is created, this would already be done.

That is, if I ask for the following changes:

create class ID:auto "new label"
add definition "new definition" to ID:auto
create edge ID:auto rdfs:subClassOf EX:1234

the KGCL engine (either the Python library or my KGCL-Java) would:

  1. figure out which ID to assign;
  2. replace ID:auto by the assigned ID in all the changes;
  3. proceed with making the changes.

And then Ontobot would go on and create the PR with those changes.

Yes, such a workflow would create a risk of having concurrent PRs with clashing IDs. If there is already a PR with an auto-assigned ID waiting to be merged, and someone asks for another class creation in a second PR before the first one is merged, then the second PR would end up with the same auto-assigned ID. But this is a risk that already exists right now with manually created PRs.

Still, if that is a concern, we could assign “temporary” auto-generated IDs that would have to be converted to permanent IDs in a later step using something like @balhoff’s mint command (possibly at merge time). But I feel like this should be optional: it should be possible to have auto-assigned IDs without having to rely on a specific ROBOT command to finalise the work.

cmungall commented 7 months ago

While we are doing this should we also consider:

create class _:1 "Mammal"
create class _:2 "Dog"
create edge _:1 rdfs:subClassOf _:2

This uses blank node syntax which I'm 75% sure is a bad idea (what if at some point in the future we want to allow blank nodes qua blank nodes)

But the same thing could be achieved using a marker prefix, such as AUTO

(Seeing the ID:auto reminded me that this was always the default in oboedit and the web is still littered with ontologies with ID:nnnn identifiers..., underscoring the importance of enshrining the marker prefix as a standard so there is no leakage...)

gouttegd commented 7 months ago

But the same thing could be achieved using a marker prefix, such as AUTO

My current thinking (theoretical only; I have not written any code for that yet) is to make the prefix configurable at the application-level, with a default of AUTO: (or maybe AUTOID: to make the intention more clear).

That is, by default it would be possible to do:

create class AUTOID:1 "Mammal"
create class AUTOID:2 "Dog"
create edge AUTOID:2 rdfs:subClassOf AUTOID:1

but if someone wants to use KGCL on an ontology where AUTOID is a legitimate prefix (seems unlikely but who knows?), then they would be able to specify another pseudo-prefix (with an option like --auto-id-prefix MY_CUSTOM_PREFIX). For now it would be possible to set choose a pseudo-prefix of _ (and therefore to use _:1 syntax), but if we later decide to extend KGCL to allow the manipulation of blank nodes, we would just have to explicitly forbid the use of such a prefix (“error: '_' is an invalid value for the --auto-id-prefix option, because it conflicts with the syntax for blank nodes; please specify another auto ID prefix”).

gouttegd commented 6 months ago

Here is what is currently implemented in KGCL-Java (in the master branch only for now, but will be available in the upcoming 0.4.0 release):

The apply command recognises identifiers of the form AUTOID:xxx in KGCL commands (where xxx can be anything; it does not need to be a numerical ID). When applying the changes, such identifiers are replaced by automatically generated identifiers.

There are three “modes“ to determine how the identifiers are generated:

“Manual“ mode. Identifiers are generated according to parameters passed on the command line, with the following options:

To minimise the risk of simultaneous PRs having the same auto-generated IDs, the identifiers are picked randomly within the specified range (existing IDs in the ontology are always checked first and explicitly avoided).

ID range policy mode. Similar to manual mode, but the parameters are taken from a -idranges.owl file (as automatically generated by the ODK, and already used by Protégé 5.6.x). This is controlled by two options:

If --auto-id-range-name is not used, the command will automatically look up for a range allocated to kgcl, KGCL, ontobot, or Ontobot.

Temporary mode. Generates randomly generated, temporary identifiers that should later be replaced by a mint command as suggested by @balhoff . This mode is selected by the --auto-id-temp-prefix <prefix> option, where <prefix> is the prefix in which the temporary identifiers are generated (identifiers are of the form <prefix><uuid>).

KGCL-Java also includes a putative implementation of such a mint command.