ESIPFed / sweet

Official repository for Semantic Web for Earth and Environmental Terminology (SWEET) Ontologies
Other
117 stars 33 forks source link

Consider numeric/meaning-free URI scheme #49

Closed cmungall closed 3 years ago

cmungall commented 7 years ago

Continued from #48 - if a change is URIs is under consideration then I want to make the case for OBO-style URIs, or some other meaning-free-identifier scheme, e.g. UUIDs or URIs with incrementally assigned numeric fragments.

The reason we do this in almost all biological ontologies (whether OBO or not) is that the URI is associated with the meaning/concept, which remains static, while nomenclature drifts. For example:

image

This has clear practical advantages for biocurators and databases using the ontology. For example, a genome database may want to describe genes in terms of their biological functions using the Gene Ontology. This will be stored as associations between gene IDs and GO IDs. For both IDs, the curator makes a decision based not on terminology, but based on the defining features of the gene and the function. For the gene this may be position on the genome, sequence etc. For the function, this is based on the definition provided in GO. Both the gene ID and the GO ID are numeric, and never change (although IDs can be obsoleted, with an audit trail retained). However, the terminology and nomenclature is frequently in flux. If the database stored the association between IDs that had the nomenclature embedded, then either the curator would need to constantly update the records, or the nomenclature would need to be frozen, which would create all kinds of problems.

It's not totally clear what the intent is with SWEET. If the goal is to provide an axiomatized dictionary, then the current strategy of meaningful URI fragments is appropriate. The class really does represent the string. This is also consistent with the way SWEET creates equivalence axioms between named classes (in contrast to OBO where each concept is represented by a single class and alternate terminology is represented as annotation assertions).

Also, it may be that earth science terminology is more static than in the life sciences and medicine, in which case this ticket can be easily ignored.

However, if the classes represent different concepts or language-independent things, you may want to consider decoupling the concepts from the terminology.

Of course, meaningless URIs are not without their downsides. It's much harder to author or read triples without an IDE or appropriate interface. In practice this is less of a hurdle for domain experts, who will not be working with raw turtle in any case.

dr-shorthair commented 7 years ago

I believe the nomenclature drift issue is a much bigger issue in the life-sciences than the earth sciences, particularly in taxonomy. Of course it would be absurd to claim that there is no semantic drift, but the current scope of SWEET is a higher level than individual 'species', genes, organisms, etc. This might change in future, and there may be branches of SWEET where we should flip to numerically based URIs.

So I would agree with the proposition that 'earth science terminology is more static than in the life sciences and medicine', at least in the current scope of SWEET, so I don't think we need a general switch.

A major issue with SWEET as-is is that there are very few definitions or even labels on classes, so the last fragment of the URI does double duty, and that along with the position in the hierarchy/graph is mostly all we have for 'definition'. I raised this in another ticket #20

graybeal commented 7 years ago

While I agree with Chris's principles and motivational arguments, I agree too with Simon's assessment. SWEET is definitely a slow-drift taxonomy. On top of which, the users often won't have access to tools that present the identifiers in a label-first way, so the semantic conflation has a strong social benefit in this case. And, I think it is still a bit more of a dictionary than a model, so those dictionary facets will be better served this way.

Finally, the addition of clean versioning to all the terms will give users the ability to constrain the meanings to "the meanings as of a particular time", and this will help manage the concept drifts that will occur over coming years and centuries. (Much as author-date-ID triples do for taxonomic science today!)

Oh, and someday I think it will be great for SWEET to become fully modeled, with opaque IDs and definitions and multi-lingual prefLabels and everything!

cmungall commented 7 years ago

Thanks for the responses, always good to clarify

pbuttigieg commented 7 years ago

Late to the party, but while developing ENVO, we've seen a great deal of semantic variation behind labels, strongly supporting a meaning-free URI scheme.

The hundreds of (often heavily negotated) definitions behind "forest" or other systems of socio-ecological interest are an example where there is considerable territorialism about labels. SWEET and ENVO can't really avoid dealing with this unless only one definition is chosen, which would severely limit usage.

In other spheres such as parts of geology, long-standing ambiguity is unlikely to make meaningful URIs sustainable.

graybeal commented 7 years ago

Appreciate Pier's concern, but remember, there are very few definitions in SWEET. To create meaningless IRIs implies that meaning will be added elsewhere, and the only clue we have is the existing label. So the first step would have to be to agree on all the (inherently ambiguous) definitions. In that environment of territorialism that I know exists, this is not just a hard task, but an impossible one. And certainly not feasible by the volunteer group that exists, even with expert help.

Best to think of SWEET as a dictionary of loosely organized terms, useful for tagging and basic identification, but nothing more. To the extent more rigorous concepts are needed, it will be up to the developers of other, more specialized or rigorous ontologies (like ENVO?) to pursue them.

lewismc commented 7 years ago

Have you guys looked at the most recent master branch of SWEET? The URI's are not versioned... is this issue resolved?

graybeal commented 7 years ago

Confused as to why you're asking that question in this thread, which is not about versioning. Do you mean is issue #49 resolved, or is the issue about versioning resolved?

lewismc commented 7 years ago

Sorry @graybeal my comment us out of place and I had misread the title of the issue.

dr-shorthair commented 6 years ago

+1 to @graybeal's assessment here. In the context of SWEET there will not be hundreds of URIs for overlapping concepts. There will be just one 'forest' which means 'forest as understood in the context of SWEET', possibly with sub-classes for more refined concepts.

Furthermore, SWEET has >10 year legacy and widespread usage with non-opaque URIs. If we were starting from scratch we might do it different, but SWEET =/= OBO - it occupies a different (smaller?) niche.

brandonnodnarb commented 6 years ago

+1 to @graybeal and @dr-shorthair

cmungall commented 4 years ago

It's been over 2 years since I opened this. I remain unconvinced that you will have no need to modify the labels in the URI fragments. In fact we can see this happening here: https://github.com/ESIPFed/sweet/pull/187/files#diff-2a5c51b920f1afe55f0bac9e7e120c3cL47

Even if community nomenclature does not drift (which I am very skeptical about) there is often a need to change primary labels in an ontology to avoid ambiguity, conform to standard naming patterns, etc.

But if the community's decision is not to implement opaque URIs, go ahead and close this. However, I do recommend that you add guidelines to https://github.com/ESIPFed/sweet/blob/master/CONTRIBUTING.md for what to do when a concept does need to change it's URI fragment, or when a concept is to be obsoleted. And also an end-user guide on what to do when URIs change, to avoid referential integrity-like violations.

See for example the GO editors guide on obsoleting an ID/URI: http://wiki.geneontology.org/index.php/Obsoleting_an_Existing_Ontology_Term

rduerr commented 4 years ago

Community nomenclature does drift in the Earth Sciences! I don’t know where the idea that it doesn’t came from; but there are plenty of examples in the GCW compilation where nomenclature (even spelling) has changed over time (and for that matter over location, discipline, and other variables as well).

I would be very happy to argue for opaque URI’s and labels that might help with this; but we also need the ability to note what I might call “realms of applicability” to help those trying to combine semantics with NLP/ML. In other words for the semantics to let you know that prior to date X, this set of entailments was called Label1 in the literature; but after date X the term Label2 was more commonly used.

Ruth

Sent from my iPhone

On Mar 11, 2020, at 9:02 PM, Chris Mungall notifications@github.com wrote:

It's been over 2 years since I opened this. I remain unconvinced that you will have no need to modify the labels in the URI fragments. In fact we can see this happening here: https://github.com/ESIPFed/sweet/pull/187/files#diff-2a5c51b920f1afe55f0bac9e7e120c3cL47

Even if community nomenclature does not drift (which I am very skeptical about) there is often a need to change primary labels in an ontology to avoid ambiguity, conform to standard naming patterns, etc.

But if the community's decision is not to implement opaque URIs, go ahead and close this. However, I do recommend that you add guidelines to https://github.com/ESIPFed/sweet/blob/master/CONTRIBUTING.md for what to do when a concept does need to change it's URI fragment, or when a concept is to be obsoleted. And also an end-user guide on what to do when URIs change, to avoid referential integrity-like violations.

See for example the GO editors guide on obsoleting an ID/URI: http://wiki.geneontology.org/index.php/Obsoleting_an_Existing_Ontology_Term

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

smrgeoinfo commented 4 years ago

Yes, labels for concepts sometimes change. From a practical point of view of a developer or knowledge engineer working with instance data, opaque URIs are a nightmare. There isn't really any middle ground-- someone has to take the pain. either the low-level user, or the ontology manager.

rduerr commented 4 years ago

I think that depends on the tooling available (which is pretty poor at the moment I agree). In any case, I am not sure who your low-level users are but I tend to think that dealing with complexity and making that transparent to end users is the job of developers and knowledge engineers. I am not sure where anyone would put me, but I am used to opaque URIs and they don't bother me either at the instance or class level, etc....

cmungall commented 4 years ago

For developers: json-ld contexts and analogous mechanisms are your friends. Even turtle prefixes. Just declare a string to URI mapping in the header, then use those strings locally in documents/code. Of course, it's still good to sync these with any label changes in the ontology but refactoring is pretty easy.

On Wed, Mar 11, 2020 at 9:59 PM Stephen Richard notifications@github.com wrote:

Yes, labels for concepts sometimes change. From a practical point of view of a developer or knowledge engineer working with instance data, opaque URIs are a nightmare. There isn't really any middle ground-- someone has to take the pain. either the low-level user, or the ontology manager.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/49#issuecomment-598007200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOORGPQLZYJBE7DDW7LRHBT2DANCNFSM4D34HBDA .

cmungall commented 4 years ago

Also, often there is no need for a developer to know or care about many of the concepts in a large ontology like sweet. Maybe the upper levels may materialize in code, but I don't really know your use case...

On Wed, Mar 11, 2020 at 11:13 PM Chris Mungall cjmungall@lbl.gov wrote:

For developers: json-ld contexts and analogous mechanisms are your friends. Even turtle prefixes. Just declare a string to URI mapping in the header, then use those strings locally in documents/code. Of course, it's still good to sync these with any label changes in the ontology but refactoring is pretty easy.

On Wed, Mar 11, 2020 at 9:59 PM Stephen Richard notifications@github.com wrote:

Yes, labels for concepts sometimes change. From a practical point of view of a developer or knowledge engineer working with instance data, opaque URIs are a nightmare. There isn't really any middle ground-- someone has to take the pain. either the low-level user, or the ontology manager.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/49#issuecomment-598007200, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOORGPQLZYJBE7DDW7LRHBT2DANCNFSM4D34HBDA .

cmungall commented 4 years ago

Good to know!

I like the idea of including knowledge about usage with lexical elements (e.g. as axiom annotations)

On Wed, Mar 11, 2020 at 9:54 PM rduerr notifications@github.com wrote:

Community nomenclature does drift in the Earth Sciences! I don’t know where the idea that it doesn’t came from; but there are plenty of examples in the GCW compilation where nomenclature (even spelling) has changed over time (and for that matter over location, discipline, and other variables as well).

I would be very happy to argue for opaque URI’s and labels that might help with this; but we also need the ability to note what I might call “realms of applicability” to help those trying to combine semantics with NLP/ML. In other words for the semantics to let you know that prior to date X, this set of entailments was called Label1 in the literature; but after date X the term Label2 was more commonly used.

Ruth

Sent from my iPhone

On Mar 11, 2020, at 9:02 PM, Chris Mungall notifications@github.com wrote:

It's been over 2 years since I opened this. I remain unconvinced that you will have no need to modify the labels in the URI fragments. In fact we can see this happening here: https://github.com/ESIPFed/sweet/pull/187/files#diff-2a5c51b920f1afe55f0bac9e7e120c3cL47

Even if community nomenclature does not drift (which I am very skeptical about) there is often a need to change primary labels in an ontology to avoid ambiguity, conform to standard naming patterns, etc.

But if the community's decision is not to implement opaque URIs, go ahead and close this. However, I do recommend that you add guidelines to https://github.com/ESIPFed/sweet/blob/master/CONTRIBUTING.md for what to do when a concept does need to change it's URI fragment, or when a concept is to be obsoleted. And also an end-user guide on what to do when URIs change, to avoid referential integrity-like violations.

See for example the GO editors guide on obsoleting an ID/URI: http://wiki.geneontology.org/index.php/Obsoleting_an_Existing_Ontology_Term

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ESIPFed/sweet/issues/49#issuecomment-598006034, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOKKJOP3FRDFOJLTGX3RHBTHTANCNFSM4D34HBDA .

graybeal commented 4 years ago

Sorry, @cmungall, can you give or point to an example? I think I understand what these are, but it breaks my head to try to think how they could handle all the issues of seeing opaque IRIs with current tools.

Not that I'm saying I love non-opaque labels (well, I sort of do), but the alternative still feels very painful. (And not appropriate if one is constructing a 'dictionary'...)

For developers: json-ld contexts and analogous mechanisms are your friends. Even turtle prefixes. Just declare a string to URI mapping in the header, then use those strings locally in documents/code. Of course, it's still good to sync these with any label changes in the ontology but refactoring is pretty easy.

cmungall commented 4 years ago

On Thu, Mar 12, 2020 at 11:09 AM John Graybeal notifications@github.com wrote:

Sorry, @cmungall https://github.com/cmungall, can you give or point to an example? I think I understand what these are, but it breaks my head to try to think how they could handle all the issues of seeing opaque IRIs with current tools.

It's slightly orthogonal. Annotating usage metadata on synonyms is not intended to address any problem of opaque IRIs. It would just be a useful thing to have, precisely because terminology is labile.

The way in which this would be done would be the same as in all OBOs, axiom annotation on the synonym annotation.

But this is all a tangent to the main thread. Maybe start a separate issue on annotating metadata on synonyms and other annotations?

Not that I'm saying I love non-opaque labels (well, I sort of do), but the alternative still feels very painful. (And not appropriate if one is constructing a 'dictionary'...)

not following...

cmungall commented 4 years ago

Can we advance things towards closing this ticket?

There are two possible outcomes

  1. SWEET retains its current scheme of meaningful URIs
  2. SWEET changes to use a numeric schema, as in OBO

Alternatively, the SWEET governing group may decide more information is required, either from the community, or technical information on how this may work. Happy to help with that.

If 1 is chosen, we need an SOP for changing labels (given that we all agree that environmental nomenclature is not fixed, see comments above). Do we fix the URIs and change the rdfs:label? Do URIs get deprecated with a new URI being minted? It would be great if proponents of 1 could come up with a proposal (e.g. PR on editor docs).

If 2 is chosen, this will be an initial disruptive change, you will want to retain a mapping from original URIs and new OBO-style URIs. You will need an SOP for label changes but you can just adopt OBO's.

graybeal commented 4 years ago

I'll come up with a proposal for how to deal with evolving meaning of terms under #1, if proponents of #2 come up with a proposal for how the casual URI user (I suspect a high percentage of ESIP users are in this category) can make effective use of numeric schema without giving up.

As a (very personal) practical matter, I find numeric schema unhelpful for >90% of my ontology work, and I know from past conversations that many users expect the identifiers to work like dictionaries ("What does this mean today?"), not like foundational models. But I have to admit, I have no idea if "many" is 25% or 75%. And (again, just for me), this change would mean I can't quickly contribute to SWEET improvement. (This might be a net gain for SWEET, my contributions aren't so deep.)

I do see the semantic value of opaque labels, and as I said above they may be great at some point, I just think it's in SWEET's future, given inadequate tooling.)

cmungall commented 4 years ago

Happy to help with 2, but I need help understanding the user profile of SWEET (whether this is contributor, data-scientist, domain-scientist, software developer) and why they would be impacted.

I would expect most non-techy people to access the ontology through something like OntoPortal or OLS. The nice thing about these is they allow the users to search via rdfs:labels, and to display things using these too.

A subset of these users may be contributors. The more advanced ones could edit using Protege (which is easy to configure to use rdfs labels).

I would expect most techie people (data scientists, tool developers, infrastructure developers) to write largely generic code that doesn't make any assumptions about terms in the ontology. Examples: NLP tools, browsers, database loaders. There may be cases where it's useful to surface some IDs. E.g if I am making an ML app that tries to predict terrestrial features from satellite data, where the labels come from a particular branch of the ontology I may need to surface the root node ID of that branch in some config file or directly in code. Is this the kind of thing you have in mind?

My assumptions may be way off, I don't have a strong sense of how SWEET is used, what kinds of datasets are annotated using it, which triplestores/KGs use sweet IRIR

rrovetto commented 4 years ago

In general, #1 and #2 are both viable. But #2 should be either changed to mention the various possible meaningless identifiers (e.g., random-numerical, random-alphanumeric, sequential numerical, alphanumeric, as well as Dr.Lords identifier work, etc.), or a #3, #4,... #n should be added for each. There's various aspects from each person who commented that are onpoint (e.g., Dr.Graybeal's oct 2 and sep21 2017, and that of others).

Drift in meaning is important to consider. Perhaps including definitions for all SWEET terms is not best, needed or desired. Perhaps that agilility and flexibile is best. Perhaps a framework in which change in meaning or definitions of a certain term is show by (as mentioned in another post), temporally tagging them. Perhaps including various source- or group-specific meanings or definitions will be helpful. The possibility of dynamic meaning or dynamic definitions for a dynamic ontology is work considering. For those SWEET terms where definitions are not desired or possible, a meaningful URis is needed. A few things to consider. In any case, happy to help.

lewismc commented 4 years ago

@cmungall I suspect we need a PURL system in place (or need to utilize one) in order to move on with # 2 ?

smrgeoinfo commented 4 years ago

There is no perfect solution. Using opaque URIs really depends on having a suite of tools for non-technical users that insolate them from having to figure out what some random URI means, or alternatively having users who are technically savvy enough to know how to deal with the opacity. The tools cost $$money$$, and educated users cost $$$money$$$. SWEET has a bunch of URIs already defined/registered; many of these don't have definitions, so their actual semantics are in many cases up to the user.

Seems to me the way forward is for more modular ontologies that define logically coherent (computable) semantics for more constrained domains to use opaque URIs, and map to existing 'fuzzy' SWEET classes for compatibility/integration.

graybeal commented 4 years ago

with COR (set up in the way SWEET is), you don't need a purl system. The opaque URIs work the same way the non-opaque URIs work: the unique identifier fragment is just an opaque string instead of the name. It gets prepended by the SWEET path, and DNS forwarding and content resolution at that location send it to the appropriate COR page. Magic.

cmungall commented 4 years ago

@lewismc you could opt to keep using similar URLs as you do now, e.g. now you have http://sweetontology.net/humanAgriculture/Irrigation as class URLs, you could go for URLs like http://sweetontology.net/humanAgriculture0000001 or http://sweetontology.net/C0000001 (depending on whether you expect to need to transfer concepts between branches). You could take this opportunity to stick a purl. into the URL to indicate permanence. But you can make do with your existing infrastructure, i.e redirecting to http://cor.esipfed.org/

cmungall commented 4 years ago

@smrgeoinfo is there a list of tools that would need to be migrated? Most of the ontology tools/portals etc I am familiar with already bake in the assumption of opaque IDs, but I'm coming from a very different domain and set of use cases.

SWEET has a bunch of URIs [...] many of these don't have definitions, so their actual semantics are in many cases up to the user

This sounds kind of terrifying to me but again, different community!

pbuttigieg commented 4 years ago

Seems to me the way forward is for more modular ontologies that define logically coherent (computable) semantics for more constrained domains to use opaque URIs, and map to existing 'fuzzy' SWEET classes for compatibility/integration.

The modular approach is the thrust of the Federation of ontologies we're discussing, based on the OBO model, but dedicated to Earth (and perhaps space) science. SWEET is actually a set of modules.

The mapping you describe is already happening in our SWEET/ENVO cryohackathons - fuzzy SWEET URIs are mapped to ENVO classes (with "opaque" IRIs, in my experience for very good reasons), which are computable.

dr-shorthair commented 3 years ago

Changing to opaque URIs at this stage would break all existing uses of SWEET. To me that is a deal-breaker with a system that has been in use with non-opaque URIs for >10 years. I also fear that it misplaces the role of SWEET, which is more folksonomy than fully axiomatised ontology.