gbif / rs.gbif.org

GBIF machine-readable resources
https://rs.gbif.org
11 stars 13 forks source link

Recommended changes to thesaurus.xsd #75

Closed tucotuco closed 2 years ago

tucotuco commented 2 years ago

Is it worthwhile to enable controlled vocabulary files in XML to include the usage notes for the term? The controlled vocabulary values for basisOfRecord, establishmentMeans, degreeOfEstablishment, and pathway all have such notes.

This issue is similar to https://github.com/gbif/rs.gbif.org/issues/45 for properties, which is in the process of being implemented.

tucotuco commented 2 years ago

It might also be worth adding an examples attribute.

tucotuco commented 2 years ago

The controlled vocabulary schema thesaurus.xsd imports a local copy of a file dc.xsd, which is meant to be a subset of Dublin Core terms needed in GBIF schemas. In dc.xsd is a declaration of a term dc:URI, which is used in the thesaurus schema. But dc:URI is a syntax encoding scheme (a datatype) in Dublin Core, not a property. Thus, it's declaration as a property in dc.xsd with a datatype of xs:anyURI makes the term something entirely distinct and in conflict (incompatible) with the actual Dublin Core dc:URI term.

I checked the other property declarations in dc.xsd. There do not seem to be any other incompatibilities.

And if we are going to fix anything, we might as well fix everything. The namespace declaration "xmlns:dcmitype="http://purl.org/dc/dcmitype/" in thesaurus.xsd is not used. It can be omitted.

The conventional namespace abbreviation for http://purl.org/dc/terms/ is dcterms: - dc: is conventionally used for http://purl.org/dc/elements/1.1/. All of the terms in dc.xsd (except the aforementioned dc:URI) are properties in the namespace http://purl.org/dc/terms/. I recommended changing all instances of dc: to dcterms: in thesaurus.xsd.

The property dcterms:subject has the same purpose that dc:URI is being used for in thesaurus.xsd, and dc:URI is not used in any other schema in https://github.com/gbif/rs.gbif.org/tree/master/schema. I recommend replacing all instances of dc:URI in thesaurus.xsd with dcterms:subject and remove <xs:attribute name="URI" type="xs:anyURI"> from dc.xsd. Its annotation can go on <xs:attribute name="subject" type="xs:string"> instead. This means that <xs:attribute ref="dc:URI" use="required"/> needs to be replace with <xs:attribute ref="dc:subject" use="required"/> in thesaurus.xsd and <xs:attribute ref="dc:subject" use="optional"/> needs to be removed.

If desired, I can make any or all of the changes recommended in this issue. Building the controlled vocabularies referenced by the Darwin Core Core and Extension XML files depends on the resolution of this issue.

timrobertson100 commented 2 years ago

Thanks @tucotuco

Can you provide a link to an example showing the usage comments and examples, please? They seem surprisingly hard to find. I would have assumed there was only one value in the example, and the definition would describe the expected use.

mdoering commented 2 years ago

Maybe I don't understand fully, but a thesaurus defines all allowed values explicitly. Why should there then be an example and what would it hold other than one of those values? Seems rather redundant.

The concept description currently combines definition and usage notes as found in DwC terms I would think. Not sure if we need to separate between the two. If we want the definition to be more stable and allow usage notes to change freely this would be an option. But so far we had rather loose versioning of vocabularies compared to property terms and extension definitions. If we stick with that a single description sounds simpler and more appropriate to me.

in SKOS you have a definition and a scopeNote which could be seen similar to usage notes?

tucotuco commented 2 years ago

Thanks @tucotuco

Can you provide a link to an example showing the usage comments and examples, please? They seem surprisingly hard to find. I would have assumed there was only one value in the example, and the definition would describe the expected use.

The Darwin Core Classes recommended as vocabulary for basisOfRecord (e.g., PreservedSpecimen) all have examples, but do not have usage notes. All of the recommended controlled vocabulary terms (Concepts) for the Darwin Core terms establishmentMeans, degreeOfEstablishment, and pathway have usage notes, but do not have examples (except at times in the usage notes). Here is an example usage note for the Concept "native", "Considered native and naturally occuring [sic]. See also Blackburn et al. 2011 https://doi.org/10.1016/j.tree.2011.03.023 category A".

tucotuco commented 2 years ago

Maybe I don't understand fully, but a thesaurus defines all allowed values explicitly. Why should there then be an example and what would it hold other than one of those values? Seems rather redundant.

I hope this was answered in the preceding comment.

The concept description currently combines definition and usage notes as found in DwC terms I would think. Not sure if we need to separate between the two. If we want the definition to be more stable and allow usage notes to change freely this would be an option. But so far we had rather loose versioning of vocabularies compared to property terms and extension definitions. If we stick with that a single description sounds simpler and more appropriate to me.

As of https://github.com/gbif/rs.gbif.org/pull/71, the Comments and Examples have been included in extension.xsd. When those were all combined in the definition, the definition was sufficient. After the separation of the Comments and Examples into non-normative parts of the terms, they would be lost or have to be remerged into the definition in order not to lose that useful information. I wrote a script (see https://github.com/gbif/rs.gbif.org/issues/21#issuecomment-900490420) to generate the Darwin Core XML files following the new extension.xsd, so nothing is lost at all, and no manual labor has to be done to update those XML files. All content will be consistent between the standard and the XML files this way as well, unless overridden in the configuration files that are used by the script.

So, it is actually simpler now to have these three attributes separate than to have them combined. A win-win.

in SKOS you have a definition and a scopeNote which could be seen similar to usage notes?

Yes, the SKOS scopeNote is functionally the same as the <dcterms:description> in the Darwin Core terms (the "Notes" in the Darwin Core Quick Reference Guide term displays).

The Darwin Core examples are SKOS examples.

I am otherwise ready with the new controlled vocabulary files for basisOfRecord, establishmentMeans, degreeOfEstablishment, and pathway, awaiting confirmation of whether any of the proposed changes to thesaurus.xsd are acceptable at this time.

mdoering commented 2 years ago

Sounds sensible to me to also have the separation in the thesaurus.xsd then. But I don't know how much impact that would have on the IPT and other tools. My guess is rather little as its an extra field that can be picked up gradually, @marcos-lg ?

MattBlissett commented 2 years ago

@mike-podolskiy90 is now the main developer for the IPT.

Changing dc:URI to dc:subject would break currently-deployed IPTs. Could we deprecate dc:URI, but keep the attribute until IPTs ≤2.5.0 are no longer used? (Note this would take years.)

Current native definition:

<concept 
  dc:identifier="native" 
  dc:URI="http://rs.gbif.org/vocabulary/gbif/establishment_means/native" 
  dc:relation=""
  dc:description="A species that is a part of the balance of nature that has developed over hundreds or thousands of years in a particular region or ecosystem. The word native should always be used with a geographic qualifier (for example, native to New England).">
  <preferred>
   <term dc:title="native" xml:lang="en"/>
  </preferred>
  <alternative>
   <term dc:title="indigenous" xml:lang="en"  />
   <term dc:title="reintroduced" xml:lang="en"  />
  </alternative>
 </concept>

What I think John is proposing, but with the dc:URI retained:

<concept 
  dc:identifier="e001" 
① dc:URI="http://rs.tdwg.org/dwcem/values/e001" 
② dc:subject="http://rs.tdwg.org/dwcem/values/e001" 
⑤ dc:relation="https://doi.org/10.3897/biss.3.38084"
③ dc:description="A taxon occurring within its natural range."
④ dc:comments="What is considered native to an area varies with the biogeographic history of an area and the local interpretation of what is a “natural range”."
⑥ dc:examples="">
  <preferred>
   <term dc:title="native" xml:lang="en"/>
  </preferred>
  <alternative>
   <term dc:title="indigenous" xml:lang="en"  />
  </alternative>
 </concept>

① Is probably required for backward compatibility, replaced by ②.

Usage notes removed from ③ and added to new ④ dc:comments attribute.

dc:relation is used by the IPT to link to further documentation.

⑥ Doesn't exist for this concept, but does for PreservedSpecimen "A plant on an herbarium sheet. A cataloged lot of fish in a jar.".

tucotuco commented 2 years ago

@MattBlissett That is even more than I was proposing, but it is a thing of beauty. The only "tough" part for automation is extraction of the citation for dc:relation from the dc:comments where it is currently.

timrobertson100 commented 2 years ago

The only "tough" part...

Might it be worth considering if it is worth the effort? Do they do anything more than power the IPT dropdown?

tucotuco commented 2 years ago

It seems to me that the biggest consideration will be how you want to eventually produce these vocabularies automatically from the registry. That may be a ways off, so what can/should we do now so that we can have the updated vocabs available to users in the IPT?

timrobertson100 commented 2 years ago

I'd suggest we add the easy things, and keep whatever is needed to keep compatibility with the 100s of installations. I understand from the comments above, that would mean

<concept 
  dc:identifier="e001" 
① dc:URI="http://rs.tdwg.org/dwcem/values/e001" 
② dc:subject="http://rs.tdwg.org/dwcem/values/e001" 
⑤ dc:relation="https://doi.org/10.3897/biss.3.38084"
③ dc:description="A taxon occurring within its natural range."
④ dc:comments="What is considered native to an area varies with the biogeographic history of an area and the local interpretation of what is a “natural range”.">
  <preferred>
   <term dc:title="native" xml:lang="en"/>
  </preferred>
  <alternative>
   <term dc:title="indigenous" xml:lang="en"  />
  </alternative>
 </concept>

Notes

tucotuco commented 2 years ago

I'd suggest we add the easy things, and keep whatever is needed to keep compatibility with the 100s of installations. I understand from the comments above, that would mean

<concept 
  dc:identifier="e001" 
① dc:URI="http://rs.tdwg.org/dwcem/values/e001" 
② dc:subject="http://rs.tdwg.org/dwcem/values/e001" 
⑤ dc:relation="https://doi.org/10.3897/biss.3.38084"
③ dc:description="A taxon occurring within its natural range."
④ dc:comments="What is considered native to an area varies with the biogeographic history of an area and the local interpretation of what is a “natural range”.">
  <preferred>
   <term dc:title="native" xml:lang="en"/>
  </preferred>
  <alternative>
   <term dc:title="indigenous" xml:lang="en"  />
  </alternative>
 </concept>

That's super easy, it just requires the addition of

<xs:attribute ref="dc:comments" use="optional"/>

to thesaurus.xsd. I will proceed in finishing off the controlled vocabulary xml files for basisOfRecord, establishmentMeans, degreeOfEstablishment, and pathway in anticipation of this being added. They just need to have the comments added.

Notes

  • only populating ⑤ if it were easy otherwise, leave it null
  • I removed ⑥ as the text in the BasisOfRecord values ("A plant on an herbarium sheet. A cataloged lot of fish in a jar") are not examples of what people should be putting into this controlled field but are actually closer to a usage note. Examples should be cut and paste examples of what you might use in the field.

Since there seems to be a misunderstanding even here about what those examples are (examples of a PreservedSpecimen, which IS the vocabulary term), not examples of what would go in basisOfRecord for a PreservedSpecimen, and since there are no examples in the other vocabularies. I am fine with omitting those.

timrobertson100 commented 2 years ago

Thanks, @tucotuco - I think we've arrived at the design for this issue

I'll comment on the side discussion on examples here though just to help explain the thinking.

The thesauri schema was originally modeled by @mdoering and me to provide an enumerated picklist (label and definition) only, and that probably explains why we both initially questioned the notion of examples. In this thread and in the GBIF vocabulary server, the thinking is much more aligned to SKOS which is certainly no bad thing.

As we evolve, I think we would be better to strictly keep example as a means to illustrate how things should be done and use scopeNote if we want text about where you might apply the concept, which I expect will be rarely needed. This is in line with my understanding of SKOS which says that example is for an example of the use of a concept. I recognize that sentence is a bit ambiguous, but note that all the SKOS documentation does provide technical examples of use, not descriptive examples of where you would use it - e.g. see SKOS Core Vocabulary Specification where an example is a link such as this.

For PreservedSpecimen, this would then mean:

Identifier http://rs.tdwg.org/dwc/terms/PreservedSpecimen
Definition A specimen that has been preserved.
Examples link
Scope notes Suitable for use when a record represents a plant on a herbarium sheet, a cataloged lot of fish in a jar, a pinned insect etc.

Note: I use a live record link here but we wouldn't do that in practice. Note 2: Scope note could be used to clarify what do to in less usual cases such as preserving a seed etc

Hope this helps explain the thinking at least, and I'm happy to be convinced otherwise.