gbif / rs.gbif.org

GBIF machine-readable resources
https://rs.gbif.org
11 stars 13 forks source link

A series of commits to create Darwin Core Extension XML files #62

Closed tucotuco closed 2 years ago

tucotuco commented 3 years ago

The first commit amends the invalid extension XSD file, providing a missing closing tag.

The second commit adds versions of the three Core XML files with all changes to Darwin Core incorporated, and with Comments separated from definitions in keeping with the way Darwin Core is now managed, and congruent with the extension schema from the previous commit.

The third commit adds two Darwin Core Extension XML files with all changes to Darwin Core incorporated, and with Comments separated from definitions in keeping with the way Darwin Core is now managed, and congruent with the extension schema from the first of these three commits.

MattBlissett commented 3 years ago

For the species distribution extension, some of the change change the concept described.

occurrenceStatus was a statement of "how frequently the species occurs" (rare, absent, etc) but becomes just present/absent.

locationID and countryCode also have their meanings changed.

MattBlissett commented 3 years ago

Observations on the occurrence core, as I'm not sure what the IPT does i.e. if this is intentional:

There's a link to a vocabulary for establishmentMeans which doesn't (yet?) exist.

GBIF vocabularies exist for sex and typeStatus, but these aren't linked.

Life Stage doesn't have a vocabulary linked; an old one in XML format exists, but we might need a way to export this from https://registry.gbif.org/vocabulary/LifeStage

tucotuco commented 3 years ago

For the species distribution extension, some of the change change the concept described.

occurrenceStatus was a statement of "how frequently the species occurs" (rare, absent, etc) but becomes just present/absent.

locationID and countryCode also have their meanings changed. These were oversights on my part. If I had realized the terms were not standard, I would have raised a red flag rather than edit the extension.

This is why I have a problem with the uncontrolled creation of extensions, people inventing new definitions for existing terms. The extension specifically references Darwin Core terms. If the meanings here are supposed to be different, then they are not the Darwin Core terms and should be replaced with terms that do match. For example, occurrenceStatus applies to an Occurrence, not a species distribution, and has only two recommended values, 'present' and 'absent'.

The meaning of locationID isn't different, so that is good, but the usage notes (comments, which the new extension schema allows to be separate) should be augmented rather than replacing the definition of the term, which is normative.

Similarly with countryCode, except that here two different vocabularies are being recommended at the same time. It is trivial and loss-less to convert 3-letter codes to two letter codes. Why change things?

I am all for people being enabled to share data, but I think it a big mistake to call terms Darwin Core when they are not normative. I worry that GBIF might take the heat for it, since they are uniquely responsible for it being promoted.

tucotuco commented 3 years ago

There's a link to a vocabulary for establishmentMeans which doesn't (yet?) exist.

This one? http://rs.tdwg.org/dwcem.htm That does exist.

tucotuco commented 3 years ago

GBIF vocabularies exist for sex and typeStatus, but these aren't linked.

Life Stage doesn't have a vocabulary linked; an old one in XML format exists, but we might need a way to export this from https://registry.gbif.org/vocabulary/LifeStage

It would be great to include links to the GBIF-mediated vocabularies, if those are ready for it. I would not recommend linking to the old enumerations.

mdoering commented 3 years ago

The meaning of locationID isn't different, so that is good, but the usage notes (comments, which the new extension schema allows to be separate) should be augmented rather than replacing the definition of the term, which is normative.

I agree, but then lets leave those usage notes in. They are removed in this PR. We did want to recommend a specific use for identifying larger areas in particular, based on different existing systems.

Similarly with countryCode, except that here two different vocabularies are being recommended at the same time. It is trivial and loss-less to convert 3-letter codes to two letter codes. Why change things?

Why change things is my question too. This has been in use for over a decade. It is the same ISO3166 standard in the end and neither conflicting nor ambiguous to use 2 or 3 letter codes. The same is true for languages.

Btw, why is the GBIF distribution extension the only one being updated here? What about vernacular names, descriptions, species profile, etc? They all have at least one dwc term. Species profile for example contains dwc:habitat which has a definition in DwC that does not make much sense for a species profile without any "event":

A category or description of the habitat in which the Event occurred.

If we want to be strict with the correct usage of DwC terms it seems to me then that it is best to never use DwC terms outside their original context. And that DwC must also be very conservative and not easily change any definition at all. In general the application of dwc terms to species data has probably been a wrong decision in that light. But it had worked quite well so far :)

mdoering commented 3 years ago

I see the schema location has been changed from gbif.org to github, e.g. https://raw.githubusercontent.com/gbif/rs.gbif.org/master/schema/extension_2021-02-15.xsd

Is this a good idea? github might change the way they offer access to "raw" files

mdoering commented 3 years ago

I just had a brief look at the diff for the Taxon extension. It removes all extra information like explaining that dwc:taxonID is not just for accepted taxa, but also used to record a synonym. I think thats rather important to understand. We really need to add more information than just the normative definitions. In theory I don't mind adding a new attribute for this to clearly separate the normative documentation from detailed recommendations. But that would break all applications using it like the IPT.

@tucotuco any idea what to do best?

timrobertson100 commented 3 years ago

There is a lot going in this PR and we need to proceed pragmatically.

I propose that we aim for a result that the extensions do not misuse DwC terms, but we don't break everything in the process. I suspect the sensible option is that we move those few terms into the GBIF namespace so the definitions can be refined for the specific need. We could do that in a separate PR to avoid delaying the pressing need which is bringing the occurrence up to date.

... a problem with the uncontrolled creation of extensions...

I'm not surprised that we have a few in this state as these issues relate to some of the earliest extensions that were defined, at a time as things were pretty much in flux. It's incorrect to say that they were not done in a controlled manner though - there were working groups backed by multi-institution implementations who pioneered all this without guidelines. Now is a good time to fixup those oversights though.

mdoering commented 3 years ago

There is a lot going in this PR and we need to proceed pragmatically.

I propose that we aim for a result that the extensions do not misuse DwC terms, but we don't break everything in the process. I suspect the sensible option is that we move those few terms into the GBIF namespace so the definitions can be refined for the specific need. We could do that in a separate PR to avoid delaying the pressing need which is bringing the occurrence up to date.

How about splitting this PR into a more Occurrence oriented one and a Checklist oriented one then? It seems the Occurrence, Event core, Identifiction and xsd changes are rather straight forward. Taxon core and Distribution (and maybe the other extensions missing here) likely need more discussion?

MattBlissett commented 3 years ago

To make comparisons easier (I didn't find a nice XML diff tool), I've deployed this branch (VertNet/schema2021) to a new, test RS site: https://rs.gbif-uat.org/sandbox/

This is a manual deployment, and I can updated it according to whatever we need to test.

tucotuco commented 3 years ago

I don't think it is a good idea to ship these changes without further discussion. They alter the original meaning especially of the GBIF species distribution extension considerably. I will need to produce a diff locally first to assess all changes, due to the changed filenames its not immediately visible.

For the DwC core files it's a good idea to keep them up to date - but it might be good for GBIF to verify we are indeed up to date with the latest DwC terms.

I am fine with the changes to non-Core extensions not being accepted, and apologize for not realizing the extent to which they made recommendation that clash with the standard. However, I would urge that updates to those extensions be made to replace any terms that are being used with definitions that are not Darwin Core definitions (including the Class they are organized in) with different ones minted in the gbif namespace. The labels can remain the same and minimize any impact on implementations.

The Core files are a different matter. Those should definitely reflect the current standard exactly, and it was for these that I was really aiming, but got carried away with the other extensions "trying to be helpful".

tucotuco commented 3 years ago

I am also not sure if we should version the XSD files in the same way we want the IPT xml files being managed. Isn't extension_2021-02-15.xsd IPT specific?

For this I was following the versioning pattern of the XML files, knowing it was going into the sandbox for testing and further refinement as necessary. If the final result occupies schema/extension.xsd after testing, that is fine with me. Whatever makes the most sense there.

tucotuco commented 3 years ago

The meaning of locationID isn't different, so that is good, but the usage notes (comments, which the new extension schema allows to be separate) should be augmented rather than replacing the definition of the term, which is normative.

I agree, but then lets leave those usage notes in. They are removed in this PR. We did want to recommend a specific use for identifying larger areas in particular, based on different existing systems.

Sure keep the usage notes, but in Comments, not in the definition. It was my oversight (expecting standard definitions) that caused them to be overwritten.

tucotuco commented 3 years ago

Similarly with countryCode, except that here two different vocabularies are being recommended at the same time. It is trivial and loss-less to convert 3-letter codes to two letter codes. Why change things?

Why change things is my question too. This has been in use for over a decade. It is the same ISO3166 standard in the end and neither conflicting nor ambiguous to use 2 or 3 letter codes. The same is true for languages.

I know this one isn't about languages, but there isn't a one-to-one mapping between the language standards. But as far as allowing a mix of the two lists for country codes, it just seems silly to me to build in that the content will have to be resolved to a single standard from two, when just specifying the one that has always been recommended would avoid that. My perspective on silliness aside, the term definition doesn't disallow this, but it is something that should move to the Comments, not be left in the definition.

tucotuco commented 3 years ago

Btw, why is the GBIF distribution extension the only one being updated here? What about vernacular names, descriptions, species profile, etc? They all have at least one dwc term. Species profile for example contains dwc:habitat which has a definition in DwC that does not make much sense for a species profile without any "event":

A category or description of the habitat in which the Event occurred.

The Species Distribution Extension wasn't the only one to be updated, so was the Identification history. In any case, the reason is that they were ones I though I was familiar enough with to do the updates, coming from the DwC Maintenance Group perspective. The others somehow felt out of my jurisdiction in that role. Nothing more than that.

The use of dwc:habitat in the Species Profile is another case in which the term is being misapplied in an extension, at least until the definition of the term is changed. Right now it specifically refers to a Location at a place and time (an Event).

tucotuco commented 3 years ago

I just had a brief look at the diff for the Taxon extension. It removes all extra information like explaining that dwc:taxonID is not just for accepted taxa, but also used to record a synonym. I think thats rather important to understand. We really need to add more information than just the normative definitions. In theory I don't mind adding a new attribute for this to clearly separate the normative documentation from detailed recommendations. But that would break all applications using it like the IPT.

@tucotuco any idea what to do best?

Would it actually break the IPT? I don't think so. All that happened in the extension.xsd is to add an attribute for Comments. However, it would break the use of the IPT in the sense that the Comments wouldn't be visible unless the IPT was changed to show them, and people would be left with either looking up the terms and their (potentially different) comments on the Quick Reference Guide or normative terms list document, or looking up the terms on the page that documents the extension.

I think a good practice in any case would be to have Usage Guides for every extension that explain all in one place how best to use them and refer to that Guide, which can evolve outside of the extension configuration files, where stability is desirable.

peterdesmet commented 3 years ago

Hi all, catching up with this.

  1. As suggested in https://github.com/gbif/rs.gbif.org/pull/62#issuecomment-780520867, @tucotuco can you restrict this PR to the Occurrence changes?
  2. I can head a Checklist oriented PR, as I know some people are waiting for this. Any PR I should start from?