In IRIs, use opaque identifiers instead of english labels

alanruttenberg commented 3 years ago

OBO Policy was designed for good reasons.

First, by using interpretable labels you potentially alienate or confuse users in different communities where terms are known by different names.

Second, we want our ontologies to be used worldwide, and using english in IRIs is not welcoming to non-english speakers. The sanctioned mechanism for providing user readable labels is to use rdfs:label or skos properties, and literals with language tags.

Third, there will inevitably be cases where words are spelled wrong, or disputed, which makes for pressure to "fix" the IRIs. Unfortunately, such fixes are typically breaking changes to users.

jimschoening1 commented 1 month ago

Alan: But key CCO (plus domain extensions) stakeholders are voting members of IEEE Ontology Standards Work Group (OSWG). Could OSWG own the domain name (whether it is ontologyrepository or purl.ieee.sa/cco/? OSWG has time-tested Policies and Procedures (P&P) with higher levels of governance. Would that not be more trustworthy than the emerging CCO Governance Board? I realize BFO set up their own governance structure, but members of it can't join ISO as individuals. Also, even if IEEE management (who admittedly has little stake in CCO) were to defund the PURL server, the URL (purl.ieee.org/sa/cco/...) is still stable, since IEEE would have no reason to turn off the subdomain. CCO stakeholders could use that URL to stand up our own PURL server, probably under the ultimate control of IEEE, but not requiring their time or budget. In fact, that was the original request to IEEE, for us to set up a PURL server, but their IT team leader said he could do it for us. We might want to get more control of that server, which we could do a Risk Management chart to show the risk of loss of budget for the PURL Server, and show the risk mitigation that we have volunteers who have demonstrated we can run it if needed. The IT Team Leader would fight us, but I chair the IEEE PURL Server Subgroup and could take this over his head.

alanruttenberg commented 1 month ago

You wouldn't be able to use purl.ieee.org in the case that IEEE decomissioned the server because that's a subdomain of ieee.org. Subdomains are managed with records attached to the primary domain. To change the ip address of purl.ieee.org would require IEEE to execute that. In the scenario I am talking about having IEEE do anything related to the PURL breaks down. Moreover large organizations won't want their subdomains pointing to resources outside their control.

As I said, I have the most confidence in the developers and active users of the ontology having the right incentives to make resolution work in the long term, and so they should be in control of the domain. The setup would be to have at any time 3 designated members who are able to manage the domain records. Payment is cheap and we've always found someone who is is willing to pay the $15 or $20 to renew the domain yearly. If at some point it seems like there's no responsible party within the community then we can always find a new steward at that point.

I have less confidence in the overarching OSWG lasting as long as the ontologies under it last.

We're going back and forth on this. What I suggest has precedent and has worked for over a decade at OBO, so it's a tested solution. I don't understand why there is so much resistance. Technical setup is trivial.

jimschoening1 commented 1 month ago

Alan, It sounds like you don't object to moving to opaque identifiers ;) These other points are off topic, but deserve further discussion, so I will break them out as new issues.

alanruttenberg commented 1 month ago

Nope. But if there's going to be a change in domain as well both changes ought to be done in one release.

jimschoening1 commented 1 month ago

After I talk to Brian Haugh next week to explore any viable compromise solution (I don't see one, but we need to try), I will likely make the motion: "P3195.1 Common Core Ontologies and all P3195.1.X extension ontologies convert to opaque identifiers." I'll suggest we conduct a roll-call vote at our 14 Aug OSWG meeting. Note: If this passes, the editors of each of the 3 drafts will make this change, but then those new versions will need to be voted on. Reminder: I recused myself on this topic, so Cameron More (OSWG Secretary) will lead things after I make the motion.

alanruttenberg commented 1 month ago

@jimschoening1 If you mean opaque vs English, a semi-compromise is to defined an invertible transformation between a version of CCO with opaque ids and a version with English IRIs, with the latter to be used only for debugging purposes - not release.

From opaque->English

Add annotation property recording opaque IRI to each term
URL-encode the value of rdfs:label (optionally collapse spaces, and camelcase but this might break uniqueness, and will still probably require some url-encoding)
Rewrite each opaque IRI to have its last component be the transformed rdfs:label

From English->opaque

Rewrite each English IRI with the saved opaque IRI in the annotation property, and remove the annotation property.

This isn't a hard script to write.

BrianHaugh commented 1 month ago

If DoD-IC developers are forced to deal with opaque IRIs for all of the hundreds of CCO elements as well as with those in BFO, then it could help to have a simple script to translate such opaque content into human-readable content.

However, using such scripts to make native opaque ontologies comprehensible has obvious costs in complexity and time for debugging or even for viewing diffs between old and new versions of ontologies. Running such a script forward and backward on ontology files or on instance data files based on an ontology is time consuming and a potential source of errors.

Git protocols are widely used in development tools such as GitHub, GitLab, and Bitbucket (in the DoD-IC Ontology Foundry toolbox), which support a convenient display of diffs between old and new versions of ontologies (or software more generally) to facilitate the review of changes. These diffs are between the native files and do not include any transformation to make opaque content more human understandable. If all of CCO is opaque, then these diffs will be very difficult to understand for many developers using CCO who have not memorized the intended interpretation of well over 1,000 opaque IRIs. Restrictions, in particular, are a challenge because the class and property IRIs that are cited there are far from the annotations that could be used to interpret them. To make the diffs understandable will involve taking the relevant files out of the Git repository, translating them to be understandable, and then doing a diff on them. Then, if adjustments are needed, those need to be translated back into the native format before they are made to the native files. Such use of opaque IRIs thus creates an unnecessary burden on developers and communities reviewing ontology updates.

When syntax errors are found in ontology files (or ontology-based data files), tools such as Protégé commonly identify them by references to line numbers in the source files. So, developers have to go back to the source file and find the errors and fix them. The context of such errors is especially difficult to decipher when they are not in close proximity to assertions of labels and definitions in an ontology. It is even worse, when searching for errors in data files where only the IRIs appear for classes and properties. In such cases, a script to transform IRIs to labels could be used if needed. But that option is not always available when using other tools and applications.

Other tools and applications using ontologies sometimes provide results only in terms of the IRIs of classes, properties and individuals. Although it has been argued that we should use better different tools/applications that display labels, that is not always an option. Some projects/programs stipulate the use of certain software and developers have no choice to substitute an alternative. One such project required use of a commercial information extraction tool that used BFO at the top-level and provided its extracted result data in graphical displays using class and property IRIs. Understanding the graphs required frequent look-ups to BFO using a separate tool to translate its opaque IRIs. The visual graph displays could not be run through a script to make them understandable. If an opaque version of CCO were used in such a context, it would be even more burdensome to decipher since it would be challenging to remember what all the CCO opaque codes mean.

These are just some examples that I have experienced of the difficulties that opaque IRIs create for developers. Having translation scripts is not always much help.

From: Alan Ruttenberg @.> Sent: Wednesday, July 24, 2024 11:59 PM To: CommonCoreOntology/CommonCoreOntologies @.> Cc: Haugh, Brian A @.>; Mention @.> Subject: [EXT] Re: [CommonCoreOntology/CommonCoreOntologies] In IRIs, use opaque identifiers instead of english labels (#105)

This email originated outside of IDA. Please verify that you recognize the sender and know the content is safe before proceeding.

@jimschoening1https://github.com/jimschoening1 If you mean opaque vs English, a semi-compromise is to defined an invertible transformation between a version of CCO with opaque ids and a version with English IRIs, with the latter to be used only for debugging purposes - not release.

From opaque->English

Add annotation property recording opaque IRI to each term
URL-encode the value of rdfs:label (optionally collapse spaces, and camelcase but this might break uniqueness, and will still probably require some url-encoding)
Rewrite each opaque IRI to have its last component be the transformed rdfs:label

From English->opaque

Rewrite each English IRI with the saved opaque IRI in the annotation property, and remove the annotation property.

This isn't a hard script to write.

— Reply to this email directly, view it on GitHubhttps://github.com/CommonCoreOntology/CommonCoreOntologies/issues/105#issuecomment-2249319589, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB5A6NIYV427P7HX5G7QCZTZOBZYJAVCNFSM6AAAAABLI5TV2OVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBZGMYTSNJYHE. You are receiving this because you were mentioned.Message ID: @.**@.>>

johnbeve commented 1 month ago

@BrendaBraitling I'm moving your comment regarding licensing to the issue where that topic is now being discussed, to keep this thread focused on opaque identifiers.

jonathanvajda commented 1 month ago

@BrianHaugh

If DoD-IC developers are forced to deal with opaque IRIs for all of the hundreds of CCO elements as well as with those in BFO, then it could help to have a simple script to translate such opaque content into human-readable content. However, using such scripts to make native opaque ontologies comprehensible has obvious costs in complexity and time for debugging or even for viewing diffs between old and new versions of ontologies. Running such a script forward and backward on ontology files or on instance data files based on an ontology is time consuming and a potential source of errors. Git protocols are widely used in development tools such as GitHub, GitLab, and Bitbucket (in the DoD-IC Ontology Foundry toolbox), which support a convenient display of diffs between old and new versions of ontologies (or software more generally) to facilitate the review of changes. These diffs are between the native files and do not include any transformation to make opaque content more human understandable. If all of CCO is opaque, then these diffs will be very difficult to understand for many developers using CCO who have not memorized the intended interpretation of well over 1,000 opaque IRIs. Restrictions, in particular, are a challenge because the class and property IRIs that are cited there are far from the annotations that could be used to interpret them. To make the diffs understandable will involve taking the relevant files out of the Git repository, translating them to be understandable, and then doing a diff on them. Then, if adjustments are needed, those need to be translated back into the native format before they are made to the native files. Such use of opaque IRIs thus creates an unnecessary burden on developers and communities reviewing ontology updates. When syntax errors are found in ontology files (or ontology-based data files), tools such as Protégé commonly identify them by references to line numbers in the source files. So, developers have to go back to the source file and find the errors and fix them. The context of such errors is especially difficult to decipher when they are not in close proximity to assertions of labels and definitions in an ontology. It is even worse, when searching for errors in data files where only the IRIs appear for classes and properties. In such cases, a script to transform IRIs to labels could be used if needed.>>

This is an excellent argument for ontology versioning management software, especially for making a git diff viewer plugin across git products.

Could you spell out the functional requirements of this software? Do you use BitBucket, MS VS Code? Or...?

@BrianHaugh

But that option is not always available when using other tools and applications. Other tools and applications using ontologies sometimes provide results only in terms of the IRIs of classes, properties and individuals. Although it has been argued that we should use better different tools/applications that display labels, that is not always an option. Some projects/programs stipulate the use of certain software and developers have no choice to substitute an alternative. One such project required use of a commercial information extraction tool that used BFO at the top-level and provided its extracted result data in graphical displays using class and property IRIs. Understanding the graphs required frequent look-ups to BFO using a separate tool to translate its opaque IRIs. The visual graph displays could not be run through a script to make them understandable. If an opaque version of CCO were used in such a context, it would be even more burdensome to decipher since it would be challenging to remember what all the CCO opaque codes mean. These are just some examples that I have experienced of the difficulties that opaque IRIs create for developers. Having translation scripts is not always much help.>>

This looks like a good argument for getting some software or software plugin that

is open-license
is easy to use
has capabilities to display labels/IRIs as needed for -- git diffs -- graphs -- SPARQL queries -- SHACL editors
integrates with existing software, such as Protégé, MS Visual Studio Code, TopBraid Composer, Jena Fuseki, (GraphDB? Anzograph?) or whatever.

Limitations in use, as you mention, fall away when the software is vetted, people depend on it for workflows, and wider adoption normalizes the request to bring into various different tech environments.

neilotte commented 1 month ago

The Governance Board met Friday and re-confirmed the decision in this thread to adopt opaque local identifiers in a future release. This decision effectively closes this ticket. For this reason--and in preparation for an upcoming release of CCO--I am closing this thread.

That said, there were many other considerations raised in the course of this discussion that are worth capturing and continuing to discuss. For this reason, I have opened a discussion topic regarding this move and I have cited this ticket for those who wish to continue this conversation and refer to comments made here. Please see https://github.com/CommonCoreOntology/CommonCoreOntologies/discussions/318 .

I'll note here as well, as a new curatorial direction moving forward, we will be encouraging the issue tracker be used for concrete recommendations and bug reports for CCO and encourage longer form discussion to occur in the discussion forums. This will allow us to effectively work through all issues on the issue tracker in the course of executing a release process, while allowing those matters that warrant prolonged discussion to also take place.

Thanks all.

CommonCoreOntology / CommonCoreOntologies

In IRIs, use opaque identifiers instead of english labels #105