comments on SWEET annotation convention

ESIPFed / sweet

Official repository for Semantic Web for Earth and Environmental Terminology (SWEET) Ontologies

Other

121 stars 34 forks source link

comments on SWEET annotation convention #183

Open graybeal opened 4 years ago

graybeal commented 4 years ago

A bug report assumes that you are not trying to introduced a new feature, but instead change an existing one... hopefully correcting some latent error in the process.

Description

My take on SKOS definition comments

What would you like to see changed?

https://github.com/ESIPFed/sweet/wiki/SWEET-Annotation-Convention says if I have a comment I should create an issue, so this is my comment-issue (perhaps not really a bug, but there wasn't a good choice of templates for this).

I think we want to be able to support multiple definitions, and the only way to do that is with a well annotated definition that includes things like the source and time of creation. The second example on the page, that just shows a definition, will preclude any further definitions, or create a lot of ambiguity if you add them.

I take issue with the comment "Most of the resource metadata is packaged into the skos:definition collection node as opposed to the parent node." The resource metadata that is packaged into the skos:definition collection node is about the definition (text), and the definition (text) is about the result. Go ahead and add additional annotations about the resource itself if you want, but in SWEET so far there has been little to say about resources. It would not be a bad thing to add some metadata about the resource's contributor, date contributed, and possibly a xref or something else indicating a source for that concept. But that's really different from the definition, which will (by our previous agreement) almost always come from an external source.

lewismc commented 4 years ago

@rduerr FYI

lewismc commented 4 years ago

I think we want to be able to support multiple definitions,

I agree with that. The point I made (when this came up during ESIP) was that definitions for the same thing can mature/change over time. Lets take for example the definition of a window. There are multiple definitions here.

and the only way to do that is with a well annotated definition that includes things like the source and time of creation.

I agree with this. My response is that we need to review the code and suggest updates... which is what will probably happen once we have the convention ironed out.

The second example on the page, that just shows a definition, will preclude any further definitions, or create a lot of ambiguity if you add them.

I agree!

I take issue with the comment ... But that's really different from the definition, which will (by our previous agreement) almost always come from an external source.

@graybeal for me, you've hit the nail on the head. I completely agree with this. I would therefore state that an expanded proposal could arguably be

#@prefix dcterms: <http://purl.org/dc/terms/> .
#@prefix owl: <http://www.w3.org/2002/07/owl#> .
#@prefix prov: <http://www.w3.org/ns/prov#> .
#@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
#@prefix sorepdp: <http://sweetontology.net/reprDataProduct/Dataset/> .
#@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

###  http://sweetontology.net/reprDataProduct/Dataset
sorepdp:Dataset rdf:type owl:Class ;
               rdfs:subClassOf sorepdp:DataProduct ;
               rdfs:label "dataset"@en ;
               rdfs:comment  "This provides the motivation and justification for adding the concept."@en ;
               dcterms:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
               dcterms:modified "2019-08-17T11:35:21.06Z"^^xsd:dateTimeStamp ;
               dcterms:creator <http://orcid> ;
               # //prov:wasDerivedFrom below is optional
               prov:wasDerivedFrom <URI>
               skos:definition  [ 
                   rdfs:comment  "This provides the actual definition..."@en ;
                   dcterms:source <http://dbpedia.org/resource/Data_set> ;
                   dcterms:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
                   dcterms:modified "2019-08-17T11:35:21.06Z"^^xsd:dateTimeStamp ;
                   dcterms:creator <http://orcid> ;
                   # //prov:wasDerivedFrom below is optional
                   prov:wasDerivedFrom <URI>
             ] .

I will also state that I updated https://github.com/ESIPFed/sweet/wiki/SWEET-Annotation-Convention with feedback from everyone so far who has been involved in establushing a convention for how we do this in SWEET. This IS NOT MY OWN preference. It was my attempt at pinning down some convention narrative which we could review and hopefully, eventually agree upon.

Let me reiterate, at the ESIP Winter meeting earlier this month, there was healthy (pretty strong) disagreement from, in particular @smrgeoinfo and now @rduerr is shadowing that (which I welcome). By hashing this out together we will end up with a better result.

Thanks @graybeal for bringing this to the issue tracker.

rduerr commented 4 years ago

On the subject of multiple definitions, I am totally opposed to this! Entities with different definitions need different URI's and axioms! We need ontological mechanisms to say that thermokarst means blah, blah, blah to the permafrost community; while thermokarst means something entirely different to the geology community. In ENVO we did this by labeling the terms differently - thermokarst process vs thermokarst landform. However, it isn't at all clear that this will work in all cases. It would be convenient to be able to call thermokarst a synonym for both versions.

rduerr commented 4 years ago

I think we want to be able to support multiple definitions,

I agree with that. The point I made (when this came up during ESIP) was that definitions for the same thing can mature/change over time. Lets take for example the definition of a window. There are multiple definitions here.

Words can indeed have different meanings and yes meanings drift over time. However, unless those different meanings and drifts are captured individually with their own URI's, no reasoning can take place!

and the only way to do that is with a well annotated definition that includes things like the source and time of creation.

I agree with this. My response is that we need to review the code and suggest updates... which is what will probably happen once we have the convention ironed out.

I support the well-annotated definition concept. I like the way ENVO does it.

The second example on the page, that just shows a definition, will preclude any further definitions, or create a lot of ambiguity if you add them.

I agree!

I take issue with the comment ... But that's really different from the definition, which will (by our previous agreement) almost always come from an external source.

@graybeal for me, you've hit the nail on the head. I completely agree with this. I would therefore state that an expanded proposal could arguably be
#@prefix dcterms: <http://purl.org/dc/terms/> .
#@prefix owl: <http://www.w3.org/2002/07/owl#> .
#@prefix prov: <http://www.w3.org/ns/prov#> .
#@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
#@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
#@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
#@prefix sorepdp: <http://sweetontology.net/reprDataProduct/Dataset/> .
#@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

###  http://sweetontology.net/reprDataProduct/Dataset
sorepdp:Dataset rdf:type owl:Class ;
               rdfs:subClassOf sorepdp:DataProduct ;
               rdfs:label "dataset"@en ;
               rdfs:comment  "This provides the motivation and justification for adding the concept."@en ;
               dcterms:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
               dcterms:modified "2019-08-17T11:35:21.06Z"^^xsd:dateTimeStamp ;
               dcterms:creator <http://orcid> ;
               # //prov:wasDerivedFrom below is optional
               prov:wasDerivedFrom <URI>
               skos:definition  [ 
                   rdfs:comment  "This provides the actual definition..."@en ;
                   dcterms:source <http://dbpedia.org/resource/Data_set> ;
                   dcterms:created "2019-08-16T11:35:21.06Z"^^xsd:dateTimeStamp ;
                   dcterms:modified "2019-08-17T11:35:21.06Z"^^xsd:dateTimeStamp ;
                   dcterms:creator <http://orcid> ;
                   # //prov:wasDerivedFrom below is optional
                   prov:wasDerivedFrom <URI>
             ] .
I will also state that I updated https://github.com/ESIPFed/sweet/wiki/SWEET-Annotation-Convention with feedback from everyone so far who has been involved in establushing a convention for how we do this in SWEET. This IS NOT MY OWN preference. It was my attempt at pinning down some convention narrative which we could review and hopefully, eventually agree upon.

Let me reiterate, at the ESIP Winter meeting earlier this month, there was healthy (pretty strong) disagreement from, in particular @smrgeoinfo and now @rduerr is shadowing that (which I welcome). By hashing this out together we will end up with a better result.

Thanks @graybeal for bringing this to the issue tracker.

graybeal commented 4 years ago

@rduerr I appreciate the motivation behind the "one precise definition for each term" model, and think it is wonderful when achievable. I would like to present the case for adopting a different goal for SWEET.

When SWEET was created, I think it was driven by a community sensibility in capturing terms, but not a rigorous model for understanding and organizing those terms. There were, and are, many cases where close analysis and discussion woud lead to debate about not just the exact meaning of terms, but also their organization, their relations, and even their subsumption hierarchies. Such a close analysis was clearly not performed throughout SWEET, even in version 2.0; and of course there are no definitions, and not too many relations.

There is a practical reality behind this: that type of semantic rigor requires extensive and sometimes painful collaboration; conclusions are not always reachable in a given time; and resources were not, and are not, widely available for the task. Even once achieved, the semantic content must be maintained as scientific terminology evolves to follow the underlying science. This is an expense that I think is not suitably applied to a resource like SWEET with its community origins—the arguments around definitions would be much ado, but I think not likely to add sufficient rigor while satisfying all the community's views, and needs, of SWEET.

Instead, I think of SWEET as a generalist resource, excellent for making clear associations to specific terms, but not necessarily trying to make those terms perfect—only as well described as we in the broadest science community describe things. Which is why I think having multiple definitions from different sources for each term would be a strength, not a weakness, while also being much more readily supported from technology and community perspective.

dr-shorthair commented 4 years ago

I look at this way: a concept can persist (and thus the URI); but what we know about it (and thus the textual description) might change for good reasons that do not undermine the integrity of the concept.

As a practical example, consider the boundaries between the intervals in the geologic time scale. Even in my lifetime the temporal positions of the 'base of the Cambrian' has been re-calibrated from ca. 570 Ma to 541+/-1.0 Ma. The concept did not change, and was used continuously in geology papers with no real ambiguity during this period. But what we knew about it did, so the textual account has changed as well.

dr-shorthair commented 4 years ago

(For fun, take a look at https://www.nature.com/articles/187035a0.pdf where Holmes picks 600+/-20 Ma as 'a reasonable estimate' in 1960.)

rduerr commented 4 years ago

Ah yes, but that is a different use case then mine. In your use case, the concepts are agreed to by the community, even if their values change over time. That isn't my use case.

One of the major things I learned from the Global Cryosphere Watch glossary experience is that the concepts themselves are different in different communities and change over time. This is kind of like the definition of "bear" - does it mean an animal, something women in labor do, or what?

Or for a cryosphere concept, an example are the terms ice sheet and glacier. The question is whether an ice sheet is a glacier, a glacier is an ice sheet or whether these are two separate and sibling concepts. Depending on what subdiscipline of the cryosphere a researcher comes from only one of these definitions is correct but all three are valid in some subdiscipline. If I remember correctly both terms are in SWEET. So now what is the relationship in SWEET? What are the definitions?

As for SWEET being a generalist resource, I guess there is really only one way to go:

Provide multiple definitions for the SWEET term, each of which turns into separate terms with their own URI's in other ontologies like ENVO which needs to be more specific than SWEET since it actually gets down to the level of data. This rather complicates the harmonization effort; but might be doable. It also leaves the question of how specific communities are supported at the SWEET level (i.e., discovery will provide more irrelevant results than necessary).

dr-shorthair commented 4 years ago

I certainly agree if the concept is different, then it should indeed be denoted by a different URI (IRI). Fortunately for RDF resources (including owl:Class or skos:Concept) the identifier is separate from the label, and the description/definition separate again. I'm pretty sure we are in agreement here, but our different use-cases do illustrate that there are different patterns that can be accommodated by the same tools. Your use case is many-URIs with different definitions, but with the same label. My use case is one URI, one label, but an evolving definition.

smrgeoinfo commented 4 years ago

If SWEET is intended as a 'generalist' tool with identifiers for words having multiple, not necessarily consistent definitions, I wonder why not just use wikipedia or merriam-webster URIs for the words.

If on the other hand, SWEET Identifiers identify concepts, denoted for people to understand using logically coherent and consistent definition text, it becomes possible to use the SWEET URIs to convey meaning and support machine reasoning. This is what I was hoping the goal to be: one concept, one identifier, one definition, and some source annotation; for human communication it helps if each concept also has a unique label ('term'). As has been pointed out, definitions can evolve over time, and the maintenance challenge is to determine when it has evolved to the point that it becomes a new concept that should have a new identifier and ideally some links back to its predecessor concept/s.

graybeal commented 4 years ago

If SWEET is intended as a 'generalist' tool with identifiers for words having multiple, not necessarily consistent definitions, I wonder why not just use wikipedia or merriam-webster URIs for the words.

I find definitions from separate semantic resources are almost never fully the same, even when "authorities" declare they are the same concept. What you get with your own URIs and terms is control over the definitions, so if/when one of your sources goes off the rails, you can decide that's not an appropriate definition any more. And with your own URIs you can reference multiple definitions, and that buys users a blended understanding of the concept, not so narrow as any single resource. If those definitions contain actual conflict, you probably should resolve the conflict; but emphasizing different facets of the concept can be a strength.

If on the other hand, SWEET Identifiers identify concepts, denoted for people to understand using logically coherent and consistent definition text, it becomes possible to use the SWEET URIs to convey meaning and support machine reasoning.

It is still possible to do that as well, to the extent the community can do so, and there are many ways in which the existing logical relations in SWEET already have guided our understanding of the terms, including identifying logical inconsistencies that we could then fix. I think any logic that we can add is great, I just think SWEET is already a useful resource while the improvement to it continues.

the maintenance challenge is to determine when it has evolved to the point that it becomes a new concept that should have a new identifier and ideally some links back to its predecessor concept/s.

Totally agree, whatever path we choose there will be a maintenance activity. By using external definitions for many/most of the SWEET concepts, we limit that activity to review of content, not generating content. And I think that approach scales to a degree that writing our own definitions will not scale for some time yet.

rduerr commented 4 years ago

So is this an unsolvable impasse? We really do need to scope what SWEET is supposed to be, since there are a lot of harmonization activities going on.

graybeal commented 4 years ago

We really do need to scope what SWEET is supposed to be, since there are a lot of harmonization activities going on.

I agree Ruth. I don't think it's an unsolvable impasse.

As for SWEET being a generalist resource, I guess there is really only one way to go:

Provide multiple definitions for the SWEET term, each of which turns into separate terms with their own URI's in other ontologies like ENVO which needs to be more specific than SWEET since it actually gets down to the level of data. This rather complicates the harmonization effort; but might be doable. It also leaves the question of how specific communities are supported at the SWEET level (i.e., discovery will provide more irrelevant results than necessary).

Agree again, if we agree it is a generalist resource, but one that can evolve over time into a more powerful generalist resource. (Alternatively, the principals could agree they want a precise model toward a particular objective; but they would have to define the objectives, and likely allow time for considerable work before SWEET as a whole reaches that objective.)

ENVO which needs to be more specific than SWEET since it actually gets down to the level of data

I think SWEET's sweet spot is the traversal of domains, not the depth, but that it can get deeper as the consensus models get deeper.

If SWEET is intended as a 'generalist' tool with identifiers for words having multiple, not necessarily consistent definitions…

To be clear, I don't want definitions that are hugely different ('bear' noun vs 'bear' verb). It's not compelling value and not different than a dictionary. (If we want to build a dictionary, OK then.)

I am OK with inconsistencies like 'base of the Cambrian' being slightly recalibrated, or one definition of 'autonomous vehicle' saying "always operates without human oversight" and another "usually operates without human oversight". For some applications/users, the multiple meanings in each example represent fundamental differences; in most cases, the similarities outweigh the differences, and the differences in those definitions can illuminate the concept as it exists in the real world.

So my proposal is this:

Encode as much in SWEET as we can—in any combination of definitions, properties, or other annotations—so long as it remains a fundamentally consistent and maintainable framework, both internally and with common usage. There may be some phrases with very different meanings in science; like WikiPedia, we can discriminate them with the identifier (e.g., by adding the contextualization to it, or otherwise creating a different ID), but not with the label.

This would be a human exercise and not a precise one. It would discourage SWEET from being the authoritative model for a particular domain (because that can involve deciding 'which model is right'), but allows SWEET to map to different models that carry roughly the same concepts. (And if desired, point to sources that are in conflict.) While there's a risk some people would add too much non-maintainable detail, that behavior might self-correct over time.

I hope that is consistent with the goals of all the harmonization work that has already been going on, because there is great value in that work (both in itself, and for SWEET) to the extent it can capture a consensus set of concepts. I don't think SWEET is a great place for 'one model to rule them all' (assuming that were even possible), because SWEET starts from such a different place. But as 'one consolidated expression of many existing expressions' I think it would be an awesome resource.

And best of all, you could imagine getting there from here, with each harmonization exercise and community moving the whole forward.

smrgeoinfo commented 4 years ago

@graybeal I think you present a viable approach to thinking about what SWEET is for. The critical question for the community is (still) what is the benefit of effort to support SWEET relative to working directly on e.g. WikiPedia, which as far as I can tell, is doing that same thing with a bigger community footprint, or leaning on the American Geological Institute (AGI) to create a web resource for the AGI Glossary (with URIs, SKOS encoding...)?

My guess is the answer to the benefit question would be 'SWEET is already widely adopted', but what are the data on its adoption, usage in operational applications?

Extracts from the discussion; brackets '[]' are my glosses, questions... SWEET purpose:

map [terms] to different models that carry roughly the same concepts [this appears to be the ongoing harmonization activity]
SWEET's sweet spot is the traversal of domains
reference multiple definitions, and that buys users a blended understanding of the concept, not so narrow as any single resource.
If ... definitions contain actual conflict, .... resolve the conflict; [?different URIs?]

lewismc commented 4 years ago

Hi @smrgeoinfo I updated your comment such that it has numbers... that will make it easier for me to respond.

... or leaning on the American Geological Institute (AGI) to create a web resource for the AGI Glossary (with URIs, SKOS encoding...)?

Geology is included in SWEET but it is one part of the wider knowledge space. I don't know much about the AGI and how active/interested they are in that kind of work. I would really appreciate your take on that... I am in the dark here.

My guess is the answer to the benefit question would be 'SWEET is already widely adopted', but what are the data on its adoption, usage in operational applications?

We do have server logs from the COR server and can subset HTTP requests for only SWEET resources. Would you like me to do this and find a way for publishing these results? That is my best suggestion at this point. SWEET is used in many OGC documents/efforts. For example, the 2015 Testbed-11 Catalogue Service and Discovery Engineering Report, the Linked And Networked DRoneS project LANDRS, Observations and Measurements - XML Implementation and a bunch of other pretty high profile OGC efforts. What is needed here is for someone (probably me) to find all of these documents/specifications and to update the SWEET IRI's to the new sweetontology.net. I am going to work with George Percivall to quantity this task before I do anything extravagant.

Yes I agree. Right now this is the case because SWEET provides a form** of semantic structure but little semantic meaning. Obviously the structure is valuable enough as it specifies relationships which would otherwise be lost.
Yes, that's what I am trying to get at above.
That makes sense to me however I thought we were not using the 'definition' terminology if there was more than 1 definition for a given resource...?
Yes

** By this I mean the 9 top-level modules e.g. human, matr, phen, proc, prop, realm, rela, repr and state. The problem I have is that it was never really made clear why SWEET is structured this way... or at-least I have yet to be convinced as to why this knowledge organization scheme/design pattern was chosen.

lewismc commented 4 years ago

I should also say, that if SWEET were to simply become part of Wikidata or WikiPedia or something then I would want it to be atleast available via a named graph... I don't anticipate that this would happen but i may be wrong. We would need to be instrumental in achieving that.

rduerr commented 4 years ago

I don't think AGI has a broad enough scope to do all of SWEET. The cryospheric questions we've already looked at demonstrate that at least!

I don't know that Wikidata or WikiPedia are controllable enough to qualify as logical homes for SWEET.

I think 1-4 above are fine, though we need to document assumptions about the meaning of terms in the SWEET hierarchy as they currently exist. The harmonization efforts have assumed things like if it is in the process hierarchy then it must be a process, so if the term has multiple meanings map it to a term that is a process. I think that is the appropriate thing to do. I note that for item 4 above, that may mean that a SWEET term in the process branch might get mapped to a term with a label like: "[SWEET term name] process" in order to distinguish for example verb versions from noun versions, etc.

Can we consider this an agreement? Should we wait until the July meeting?

graybeal commented 4 years ago

I note that for item 4 above, that may mean that a SWEET term in the process branch might get mapped to a term with a label like: "[SWEET term name] process" in order to distinguish for example verb versions from noun versions, etc.

Whether and how to change the label is up the community, but it would reduce ambiguity and enhance search in certain label-aware applications.

I was thinking it was relevant (given that SWEET is semantic identifiers) to change the identifiers. But now that I think about it, the context is already in the identifier, even if severely abbreviated, so never mind about that.

lewismc commented 4 years ago

@graybeal are you happy with the way the annotation's are proposed in https://github.com/ESIPFed/sweet/pull/201 ?