gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

Duplicate occurrenceID into materialSampleID for collections #29

Open rukayaj opened 3 years ago

dagendresen commented 3 years ago

Apropos -- see https://github.com/tdwg/dwc/issues/314

Occurrence ≠ MaterialSample !!!

One way of thinking (and our first way of thinking for this issue) could be to regard the current occurrenceIDs as in reality to be materialSampleIDs and that what the specimens actually need are new and different occurrenceIDs (for the species occurrence (Occurrence)) because the current occurrenceIDs in reality are materalSampleIDs... And thus proceed to move the occurrenceIDs we have into materialSampleID. But I think we should put this on hold, in hope that the domain model might actually be fixed.

The better thing would thus be to wait for a proper materialSampleID (or a DigitalSample ID / ExtendedSpecimen ID -- or maybe a PreservedSpecimenID).

And to simply think of the current occurrenceIDs we have as identifiers for the species occurrence... and not for the specimen.

We (as in the TDWG & GBIF community) urgently need a change in how we think of Occurrence!!!

(basisOfRecord must urgently be deprecated).

rukayaj commented 3 years ago

Oh dear, I just changed most of the MUSIT based datasets yesterday! Ok so we should just continue to have no materialSampleID for these datasets then, you think? Based on that thread I think in our case it could also make sense to make a have a materialSampleID which is the occurrenceID with a set suffix or a prefix. I'm wary of changing the occurrenceIDs because it's going to confuse other systems like the resolver and Bionomia.

dagendresen commented 3 years ago

Agree, way too much depend on the current occurrenceIDs remaining unchanged --> thus we should not change the occurrenceIDs!

The important part is that materialSampleID would identify a different thing and cannot be the same UUID as the occurrenceID. Thus our specimens would remain without an identifier identifying them as specimens. Only indirectly identified as derived from the Occurrence as a proxy.

Thus wait to assign materialSampleIDs until we have a more clear domain model?

The DNA bank material samples would be a reasonable start (read less controversial) for assigning materialSampleIDs (that are different UUIDs from their occurrenceIDs that they are derived from).

rukayaj commented 3 years ago

I'm going to close this seeing as it's not on our immediate to-do list any more, or we can keep it open and label it with something if you prefer? The DNA bank material samples already have materialSampleIDs 👍

dagendresen commented 3 years ago

Good (ps: just checking -- assuming that the DNA bank samples materialSampleID are different UUIDs from their occurrenceIDs?)

rukayaj commented 3 years ago

I am fairly sure they are different but will check quickly now

Edit: Actually, no I was wrong - it looks like they actually don't have materialSampleIDs, I just thought they did because they're using the extension. But they link to the occurrenceID without having a separate materialSampleID. I'll ask about the possibility of generating and adding some.

dagendresen commented 3 years ago

This issue will not escape my head :-)

I think that the UUIDs we have in the QR-codes ON the specimens ARE materialSampleIDs. Currently, they are published in the GBIF as occurrenceID, but when/if the data models evolve with a new MaterialSample Core (I think that) these UUIDs would much more appropriately be transferred to materialSampleID...

One way to solve this problem (IDs already declared as occurrenceID) could be to think of the https://purl.org/gbifnorway/id/UUID as the occurrenceID and reserve the urn:uuid:UUID and a new PURL or preferably a handle or a DOI built from the same UUID to become the materialSampleID.

Would we thus consider keeping the https://purl.org/gbifnorway/id/UUID mapped to occurrenceID and map the urn:uuid:UUID as materialSampleID??? Like, just-do-it now?

PS: I think that many specimens from UiO-NHM have QR-codes for https://purl.org/nhmuio/id/UUID - which we could simply see as an alias for the https://purl.org/gbifnorway/id/UUID ???

Maybe create https://purl.org/gbifnorway/MaterialSample/UUID ... ???

dagendresen commented 3 years ago

We could simply populate the occurrenceID (for specimens) with the Darwin Core triplets -- and move the UUIDs to materialSampleID.

Anyway, delaying fixing this problem will not make it go away -- or become less of a problem as time passes...

dagendresen commented 3 years ago

Another obvious conclusion could simply be that the specimens published as gbif:Occurrence of type (basisOfRecord=) PreservedSpecimen are NOT dwc.Occurrence!!! And simply proceed to copy the UUIDs to materialSampleID while for the time being keeping them also as the occurrenceID (for historical reasons).

And then when a MaterialSample core finally is developed and released, progress to stop publishing occurrenceIDs for the collection datasets.

Important would be that the resolver would NOT declare the UUIDs to identify an Occurrence, but declare them to identify a MaterialSample.

Maybe discuss on Tuesday?

rukayaj commented 3 years ago

I am not really sure what makes the most sense to do here, but I can certainly change the mapping back again so occurrenceID is copied into materialSampleID. Sure, we can discuss on Tuesday.

dagendresen commented 3 years ago

I believe the least pain (for the most people) will be to wait for the GBIF data publishing mechanism to allow us to publish collection specimens completely without occurrenceID (using a MaterialSample core or similar) -- and at that time move the urn:uuid:UUIDs from occurrenceID to materialSampleID (... or if enabled maybe preferably (??) to some evidenceID or tokenID).

I believe that the current basisOfRecord = PreservedSpecimen would make an appropriate rationale for such a shift from occurrenceID to materialSampleID.

Meanwhile, the resolver could already now resolve to describing the resolved thing only as a MaterialSample and NOT as an Occurrence. And describe the respective UUIDs only as dct:identifier and NOT as occurrenceID.

rukayaj commented 2 years ago

We decided today in a meeting that it is safe to use the QR code UUIDs as materialsampleIDs, so the resolver needs to resolve these.

rukayaj commented 2 years ago

https://github.com/gbif-norway/resolver-docker/issues/19

dagendresen commented 2 years ago

We decided today in a meeting that it is safe to use the QR code UUIDs as materialsampleIDs, so the resolver needs to resolve these.

Maybe "appropriate to use ..." rather than "safe to use ...". We do not really know if they are "safe" to use yet until we gather some more experience. But I think that we agree that the UUIDs encoded in the QR-codes are appropriate to (try) to use as identifiers (type materialSampleID) for the specimen (when we mean "specimen" of type dwc:MaterialSample).

rukayaj commented 2 years ago

We're now publishing materialSampleIDs for the NHM collections in MUSIT, e.g. https://www.gbif.org/occurrence/1094979273 has materialSampleID e1dfb5b8-a4dc-4e51-9dc7-4c174a955411

As can be seen from this PR https://github.com/gbif-norway/resolver-docker/pull/27, nothing much has changed on the resolver as it has been resolving these IDs since Feb (ever since we started testing with CETAF). The only difference is that previously we were publishing these IDs using otherCatalogNumbers, whereas now the resolver is now listening to materialSampleID instead of the otherCatalogNumbers field. So for example this is the above materialSampleID, https://resolver.gbif.no/e1dfb5b8-a4dc-4e51-9dc7-4c174a955411/.

dagendresen commented 2 years ago

Maybe make them urn:uuid:e1dfb5b8-a4dc-4e51-9dc7-4c174a955411 ?

rukayaj commented 2 years ago

I think I can do that be prefixing with 'urn:uuid:' in the query we use to get that field from MUSIT, but that would mean it would only work for the first UUID in the cases where we have 2 or more materialSampleIDs... Like this one: https://www.gbif.org/occurrence/2867544788

rukayaj commented 2 years ago

I can make it just prefix urn:uuid in the resolver though, and not publish them in GBIF as urn:uuid:

dagendresen commented 2 years ago

in the cases where we have 2 or more materialSampleIDs

We should NEVER have more than one materialSampleID (for each collection specimen)! Detecting more than one materialSampleID should trigger a warning and notification to the data publisher - And if not fixed, at least flagging the data record as an error, and probably exclude it from the resolver?

dagendresen commented 2 years ago

I can make it just prefix urn:uuid in the resolver though, and not publish them in GBIF as urn:uuid:

If we prefix the UUIDs (as I think we should) the identifier name without the prefix is a different identifier name! I think that the prefixed urn:uuid:UUID is different from the naked UUID and should never be mixed! ;-)

I think, however, that we can present both the urn:uuid:UUID and the http://purl.org/UUID as alternative identifier name formats - identifying the same thing (and maybe also explore if can be complemented by something like https://doi.org/urn:uuid:UUID ??)

rukayaj commented 2 years ago

in the cases where we have 2 or more materialSampleIDs

We should NEVER have more than one materialSampleID (for each collection specimen)! Detecting more than one materialSampleID should trigger a warning and notification to the data publisher - And if not fixed, at least flagging the data record as an error, and probably exclude it from the resolver?

We have quite a lot of records with more than one materialSampleID :( I asked Eirik about this particular one, and he says it's an error. Some of them occur because we have one specimen on two sheets, and each sheet has a QR code. I guess it's probably easiest for us just to split on ';' and publish only the first materialSampleID present. Then we can add the urn:uuid: prefix.

dagendresen commented 2 years ago

I am reluctant to "let" specimens with multiple UUIDs simply pass through. Maybe rather report back to the data publisher as an error and demand ONE single materialSampleID BEFORE publishing this on to GBIF?

I believe that the most common reason for multiple UUIDs is that there are multiple photos of the specimen and that each photo was assigned each own UUID...!!

If the "same" occurrence is on two different sheets, I will argue that we have two different specimens and that they really should each have different materialSampleIDs!!! But share the same occurrenceID ;-)

rukayaj commented 2 years ago

I am reluctant to "let" specimens with multiple UUIDs simply pass through. Maybe rather report back to the data publisher as an error and demand ONE single materialSampleID BEFORE publishing this on to GBIF?

Ok, I will make a file with a list of all records with multiple UUIDs in this field and ask for them to be corrected :)

rukayaj commented 2 years ago

Here's an interesting (!!!!) one: 38 different specimen sheets for one specimen https://www.gbif.org/occurrence/1702254878

dagendresen commented 2 years ago

Wow -- and all herbarium sheets look (from the photos) to share the catalog number 394440. But the QR codes look different, so if they code for the published materialSampleID UUIDs, I think this is actually perfect.