Open rukayaj opened 3 years ago
Oh dear, I just changed most of the MUSIT based datasets yesterday! Ok so we should just continue to have no materialSampleID
for these datasets then, you think? Based on that thread I think in our case it could also make sense to make a have a materialSampleID
which is the occurrenceID
with a set suffix or a prefix. I'm wary of changing the occurrenceIDs
because it's going to confuse other systems like the resolver and Bionomia.
Agree, way too much depend on the current occurrenceIDs remaining unchanged --> thus we should not change the occurrenceIDs!
The important part is that materialSampleID would identify a different thing and cannot be the same UUID as the occurrenceID. Thus our specimens would remain without an identifier identifying them as specimens. Only indirectly identified as derived from the Occurrence as a proxy.
Thus wait to assign materialSampleIDs until we have a more clear domain model?
The DNA bank material samples would be a reasonable start (read less controversial) for assigning materialSampleIDs (that are different UUIDs from their occurrenceIDs that they are derived from).
I'm going to close this seeing as it's not on our immediate to-do list any more, or we can keep it open and label it with something if you prefer? The DNA bank material samples already have materialSampleIDs 👍
Good (ps: just checking -- assuming that the DNA bank samples materialSampleID are different UUIDs from their occurrenceIDs?)
I am fairly sure they are different but will check quickly now
Edit: Actually, no I was wrong - it looks like they actually don't have materialSampleIDs, I just thought they did because they're using the extension. But they link to the occurrenceID without having a separate materialSampleID. I'll ask about the possibility of generating and adding some.
I think that the UUIDs we have in the QR-codes ON the specimens ARE materialSampleIDs. Currently, they are published in the GBIF as occurrenceID, but when/if the data models evolve with a new MaterialSample Core
(I think that) these UUIDs would much more appropriately be transferred to materialSampleID...
One way to solve this problem (IDs already declared as occurrenceID) could be to think of the https://purl.org/gbifnorway/id/UUID
as the occurrenceID and reserve the urn:uuid:UUID
and a new PURL or preferably a handle or a DOI built from the same UUID to become the materialSampleID.
Would we thus consider keeping the https://purl.org/gbifnorway/id/UUID
mapped to occurrenceID and map the urn:uuid:UUID
as materialSampleID??? Like, just-do-it now?
PS: I think that many specimens from UiO-NHM have QR-codes for https://purl.org/nhmuio/id/UUID
- which we could simply see as an alias for the https://purl.org/gbifnorway/id/UUID
???
Maybe create https://purl.org/gbifnorway/MaterialSample/UUID
... ???
We could simply populate the occurrenceID (for specimens) with the Darwin Core triplets -- and move the UUIDs to materialSampleID.
Anyway, delaying fixing this problem will not make it go away -- or become less of a problem as time passes...
Another obvious conclusion could simply be that the specimens published as gbif:Occurrence
of type (basisOfRecord=) PreservedSpecimen are NOT dwc.Occurrence
!!! And simply proceed to copy the UUIDs to materialSampleID while for the time being keeping them also as the occurrenceID (for historical reasons).
And then when a MaterialSample core finally is developed and released, progress to stop publishing occurrenceIDs for the collection datasets.
Important would be that the resolver would NOT declare the UUIDs to identify an Occurrence, but declare them to identify a MaterialSample.
Maybe discuss on Tuesday?
I am not really sure what makes the most sense to do here, but I can certainly change the mapping back again so occurrenceID is copied into materialSampleID. Sure, we can discuss on Tuesday.
I believe the least pain (for the most people) will be to wait for the GBIF data publishing mechanism to allow us to publish collection specimens completely without occurrenceID
(using a MaterialSample
core or similar) -- and at that time move the urn:uuid:UUIDs from occurrenceID
to materialSampleID
(... or if enabled maybe preferably (??) to some evidenceID
or tokenID
).
I believe that the current basisOfRecord
= PreservedSpecimen
would make an appropriate rationale for such a shift from occurrenceID
to materialSampleID
.
Meanwhile, the resolver could already now resolve to describing the resolved thing only as a MaterialSample
and NOT as an Occurrence
. And describe the respective UUIDs only as dct:identifier and NOT as occurrenceID
.
We decided today in a meeting that it is safe to use the QR code UUIDs as materialsampleIDs, so the resolver needs to resolve these.
We decided today in a meeting that it is safe to use the QR code UUIDs as materialsampleIDs, so the resolver needs to resolve these.
Maybe "appropriate to use ..." rather than "safe to use ...". We do not really know if they are "safe" to use yet until we gather some more experience. But I think that we agree that the UUIDs encoded in the QR-codes are appropriate to (try) to use as identifiers (type materialSampleID) for the specimen (when we mean "specimen" of type dwc:MaterialSample).
We're now publishing materialSampleIDs for the NHM collections in MUSIT, e.g. https://www.gbif.org/occurrence/1094979273 has materialSampleID e1dfb5b8-a4dc-4e51-9dc7-4c174a955411
As can be seen from this PR https://github.com/gbif-norway/resolver-docker/pull/27, nothing much has changed on the resolver as it has been resolving these IDs since Feb (ever since we started testing with CETAF). The only difference is that previously we were publishing these IDs using otherCatalogNumbers, whereas now the resolver is now listening to materialSampleID instead of the otherCatalogNumbers field. So for example this is the above materialSampleID, https://resolver.gbif.no/e1dfb5b8-a4dc-4e51-9dc7-4c174a955411/.
Maybe make them urn:uuid:e1dfb5b8-a4dc-4e51-9dc7-4c174a955411
?
I think I can do that be prefixing with 'urn:uuid:' in the query we use to get that field from MUSIT, but that would mean it would only work for the first UUID in the cases where we have 2 or more materialSampleIDs... Like this one: https://www.gbif.org/occurrence/2867544788
I can make it just prefix urn:uuid in the resolver though, and not publish them in GBIF as urn:uuid:
in the cases where we have 2 or more materialSampleIDs
We should NEVER have more than one materialSampleID (for each collection specimen)! Detecting more than one materialSampleID should trigger a warning and notification to the data publisher - And if not fixed, at least flagging the data record as an error, and probably exclude it from the resolver?
I can make it just prefix urn:uuid in the resolver though, and not publish them in GBIF as urn:uuid:
If we prefix the UUIDs (as I think we should) the identifier name without the prefix is a different identifier name! I think that the prefixed urn:uuid:UUID
is different from the naked UUID
and should never be mixed! ;-)
I think, however, that we can present both the urn:uuid:UUID
and the http://purl.org/UUID
as alternative identifier name formats - identifying the same thing (and maybe also explore if can be complemented by something like https://doi.org/urn:uuid:UUID
??)
in the cases where we have 2 or more materialSampleIDs
We should NEVER have more than one materialSampleID (for each collection specimen)! Detecting more than one materialSampleID should trigger a warning and notification to the data publisher - And if not fixed, at least flagging the data record as an error, and probably exclude it from the resolver?
We have quite a lot of records with more than one materialSampleID :( I asked Eirik about this particular one, and he says it's an error. Some of them occur because we have one specimen on two sheets, and each sheet has a QR code. I guess it's probably easiest for us just to split on ';' and publish only the first materialSampleID present. Then we can add the urn:uuid: prefix.
I am reluctant to "let" specimens with multiple UUIDs simply pass through. Maybe rather report back to the data publisher as an error and demand ONE single materialSampleID BEFORE publishing this on to GBIF?
I believe that the most common reason for multiple UUIDs is that there are multiple photos of the specimen and that each photo was assigned each own UUID...!!
If the "same" occurrence is on two different sheets, I will argue that we have two different specimens and that they really should each have different materialSampleIDs!!! But share the same occurrenceID ;-)
I am reluctant to "let" specimens with multiple UUIDs simply pass through. Maybe rather report back to the data publisher as an error and demand ONE single materialSampleID BEFORE publishing this on to GBIF?
Ok, I will make a file with a list of all records with multiple UUIDs in this field and ask for them to be corrected :)
Here's an interesting (!!!!) one: 38 different specimen sheets for one specimen https://www.gbif.org/occurrence/1702254878
Wow -- and all herbarium sheets look (from the photos) to share the catalog number 394440. But the QR codes look different, so if they code for the published materialSampleID UUIDs, I think this is actually perfect.
Apropos -- see https://github.com/tdwg/dwc/issues/314
Occurrence ≠ MaterialSample !!!
One way of thinking (and our first way of thinking for this issue) could be to regard the current occurrenceIDs as in reality to be materialSampleIDs and that what the specimens actually need are new and different occurrenceIDs (for the species occurrence (Occurrence)) because the current occurrenceIDs in reality are materalSampleIDs... And thus proceed to move the occurrenceIDs we have into materialSampleID. But I think we should put this on hold, in hope that the domain model might actually be fixed.
The better thing would thus be to wait for a proper materialSampleID (or a DigitalSample ID / ExtendedSpecimen ID -- or maybe a PreservedSpecimenID).
And to simply think of the current occurrenceIDs we have as identifiers for the species occurrence... and not for the specimen.
We (as in the TDWG & GBIF community) urgently need a change in how we think of Occurrence!!!
(basisOfRecord must urgently be deprecated).