gbif-norway / helpdesk

Please submit your helpdesk request here (or send an email to helpdesk@gbif.no). We will also use this repo for documentation of node helpdesk cases.
GNU General Public License v3.0
3 stars 0 forks source link

470 records published by IMR now included in UiB collection #130

Closed rukayaj closed 1 year ago

rukayaj commented 1 year ago

470 PreservedSpecimen (occurrenceIDs = UUIDs, institutionCode = IMR) records published in the artsprosjektet_53-14_copclad dataset from NBIC will be added in MUSIT and published to Invertebrate collections, UiB.

Note: At NHM UiO, occasionally someone takes a sample and logs a sighting of e.g. a moss in Artsobs, which gets published to GBIF by NBIC. This moss sample later gets sent to NHM and included as a specimen in MUSIT and published to GBIF again. Then we use MUSIT "Artsobs nr." -> GBIF "associatedOccurrences" to link the two. We do not have materialSampleIDs currently in MUSIT for the NHM UiO MUSIT datasets.

In MUSIT, Invertebrate collections, UiB has UUIDs in the MUSIT "UUID" field, which are currently published to GBIF in the materialSampleID field. It is possible to map any field from MUSIT onto any valid DwC field.

Options for linking artsprosjektet_53-14_copclad occurrenceID UUIDs with the records to be published in MUSIT:

**1. MUSIT: Artsobs nr. -> GBIF: associatedOccurrences Artsobs nr. does not allow UUID format

  1. MUSIT: UUID field -> GBIF: materialSampleID
  2. MUSIT: UUID field -> GBIF: associatedOccurrences
  3. MUSIT: UUID field -> GBIF: occurrenceID
  4. MUSIT: [some other?] field -> GBIF: occurrenceID**

For 2:

This would keep all the fields in MUSIT and in GBIF the same as all the other records, and is therefore the simplest. But is it correct? In artsprosjektet_53-14_copclad the records (identified by occurrenceID) are not really material samples, and it's therefore slightly weird to have their occurrenceIDs in the materialSampleID field in Invertebrate collections, UiB?

For 3 and 4, we would need:

Perhaps it's possible to include two UUIDs in the MUSIT UUID field (like "c1f4134a-3777-4e48-8665-62e2fc2b9b96|fb9b6e93-592e-4f05-9cd4-d39a9d6e8abc", and always make the first UUID publish to GBIF materialSampleID, and the second to some other GBIF field.

For 4 and 5:

I am slightly reluctant because it would mean the way occurrenceID is constructed gets quite complicated, and occurrenceID is the pretty much the most important thing we're publishing. It will just make maintenance and migration more likely to go wrong in the future, but perhaps this is the most correct thing to do.

dagendresen commented 1 year ago

I think the most important is simply to explore and identify WHICH is the identifier for the dwc:Occurrence and WHICH is the identifier for the specimen itself --> and then publish them as such! (You completely lost me in the scenario description above :-D )

rukayaj commented 1 year ago

Oh dear, sorry! I was trying to summarise it succinctly :D I actually think it doesn't matter too much as long as there is some link and the metadata describes it. Then I suggest we just go ahead with 2 as the simplest option if Katrine is happy with that.

kkongshavn commented 1 year ago

Hi! Thank you for summarizing, Rukaya - the emails were getting difficult to keep track of! Previously it was said that in this case, the occurenceID should be the same for the two datasets, and the musit data should also have a unique materialsampleID?

Would it make sense to use urn:catalog:ZMBN:EVERT:151948 as materialSampleID (the format that has previously been used as occurenceID on the musitdata), and the UUID as occurrenceID? Or is that wrong?

rukayaj commented 1 year ago

I think that it's easiest to use urn:catalog:ZMBN:EVERT:151948 as the occurrenceID so it is the same as the others, and use the UUID as the materialSampleID.

kkongshavn commented 1 year ago

Allright, I will finish the file then; OccurenceID from dataset in GBIF goes into my UUID field, and becomes materialSampleID for these specimens - correct?

rukayaj commented 1 year ago

Yep! :)

kkongshavn commented 1 year ago

Allright, list edited and waiting for upload - fingers crossed!

kkongshavn commented 1 year ago

The data is in gbif/artskart now! BUT (at least in Artskart); the IMR data has OccurenceId on UUID format (as we knew): 1f415c3c-b9ea-4d3b-96c1-3de2db8df47e and the UMB data has OccurenceID (for the same record) on the format urn:catalog:ZMBN:EVERT:145125. MaterialSampleID is given under "Dynamiske egenskaper".

So as far as I understand there's still no (easy) way (in Artskart) to see which two records are the same* - but will it be possible in GBIF?

dagendresen commented 1 year ago

If a given UUID string is published as occurrenceID in the Artskart/Artsobs dataset (?) then it would be an extremely bad idea to use the SAME UUID string as materialSampleID in another dataset! :-D

E.g.

https://www.gbif.org/occurrence/2875928397 (from artsprosjektet_53-14_copclad) occurrenceID = 3d3021f0-ef81-47a4-a9b7-1a6e895e29d2

https://www.gbif.org/occurrence/4029884723 (from Invertebrate collections, UiB) materialSampleID = urn:uuid:3d3021f0-ef81-47a4-a9b7-1a6e895e29d2 occurrenceID = urn:catalog:ZMBN:EVERT:145575

If you have carefully added all the Artsobs occurreceID = UUID to the MUSIT UUID field, then these would be super perfect to use as the occurrenceID for the UiB invertebrate dataset! Sorry for not understanding this earlier!

dagendresen commented 1 year ago

Here is the GBIF clustering of similar occurrences (the new invertebrate record(s) will likely be clustered in here) https://www.gbif.org/occurrence/2875928397/cluster

kkongshavn commented 1 year ago

If a given UUID string is published as occurrenceID in the Artskart/Artsobs dataset (?) then it would be an extremely bad idea to use the SAME UUID string as materialSampleID in another dataset! :-D

E.g.

https://www.gbif.org/occurrence/2875928397 (from artsprosjektet_53-14_copclad) occurrenceID = 3d3021f0-ef81-47a4-a9b7-1a6e895e29d2

https://www.gbif.org/occurrence/4029884723 (from Invertebrate collections, UiB) materialSampleID = urn:uuid:3d3021f0-ef81-47a4-a9b7-1a6e895e29d2 occurrenceID = urn:catalog:ZMBN:EVERT:145575

If you have carefully added all the Artsobs occurreceID = UUID to the MUSIT UUID field, then these would be super perfect to use as the occurrenceID for the UiB invertebrate dataset! Sorry for not understanding this earlier!

I am SO confused now. @rukayaj , are you following this?

dagendresen commented 1 year ago

I believe you got it spot on above:

Would it make sense to use urn:catalog:ZMBN:EVERT:151948 as materialSampleID (the format that has previously been used as occurenceID on the musitdata), and the UUID as occurrenceID? Or is that wrong?

I was just not understanding what you were making :-)

The identifier string used in the Artsprosjektet dataset as occurrenceID will be perfect to use also as the occurrenceID for the UiB invertebrate collection dataset. (However, I think that we might be slightly less happy to see a Darwin Core triplet urn:catalog:institutionCode:collectionCode:catalogNumber as the materialSampleID :-) I think such Darwin Core triplets would be perfect as catalogNumber but a poor identifier for materialSampleID - but others disagree here! Anyway, the current implementation in GBIF makes nothing at all from the materialSampleIDs. But hopefully will with the planned implementation of the new GBIF data model under development.

kkongshavn commented 1 year ago

I think that what NTNU-VM has for their invertebrates must be the UUID field becoming their OccurenceID? This dataset

dagendresen commented 1 year ago

I am much less familiar with how MUSIT is mapped onto the IPT than Rukaya is. I think that UUIDs are a very good choice for both occurrenceID and materialSampleID. (But of course different actual UUID strings for different instances of either).

rukayaj commented 1 year ago

Oh dear then I completely led you astray, sorry @kkongshavn ! So I am not sure what the upshot of this is? Note that I construct the darwin core triplet as the occurrence ID, so if the occurrenceID for these records needs to change then

we would need:

dagendresen commented 1 year ago

we would need: A way to differentiate the 470 previously published

... because the UUID field in MUSIT for this collection is not filled in for all records?

Maybe pragmatically simpler (but less "correct") to just map the UUIDs (these particular UUIDs) to "associatedOccurrences" -- and accept that we have no materialSampleIDs here yet???

I am so sorry for confusing us all :-D

kkongshavn commented 1 year ago

All "my" data (~12.000 records) in musit has a UUID, which is unique, EXCEPT These 475 in the latest batch, which have the identical UUID as the IMR data has as OccurenceID

dagendresen commented 1 year ago

It is such a pain that MUSIT has no dedicated proper specimen identifier field (materialSampleID)! And we need to make such painful "hacks".

Here an annotation service (on top of a resolver) would have been so much clearer ;-)

dagendresen commented 1 year ago

All "my" data (~12.000 records) in musit has a UUID, which is unique, EXCEPT These 475 in the latest batch, which have the identical UUID as the IMR data has as OccurenceID

Then these all UUIDs sound perfect for occurrenceID ...? (except from the confusion that the MUSIT CMS is extremely unclear on what the "UUID" field is for)

rukayaj commented 1 year ago

It is such a pain that MUSIT has no dedicated proper specimen identifier field (materialSampleID)! And we need to make such painful "hacks".

In MUSIT, Invertebrate collections, UiB has UUIDs in the MUSIT "UUID" field, which are currently published to GBIF in the materialSampleID field.

So this has essentially become the materialSampleID field I guess?

Anyway, then I need a way of knowing which records should get occurrenceIDs from the UUID field and which records should get the occurrenceID darwin core triplet constructed from institutionCode, collectionCode and catalogNumber. Also we presumably need some other UUID to put in the materialSampleID field, so it's consist with the other records in the Invertebrate collections, UiB dataset.

So maybe do what I suggested before and include that UUID plus another UUID in the UUIDs in the MUSIT UUID field (like "c1f4134a-3777-4e48-8665-62e2fc2b9b96|fb9b6e93-592e-4f05-9cd4-d39a9d6e8abc", and I will always make the first UUID publish to GBIF materialSampleID, and the second to occurrenceID (option 4 in my list). If there isn't a '|' in the UUID field then the dwc triplet gets constructed as normal.

As I said before, I don't really like the idea of doing it this way as it will make data migration (i.e. to Specify) more likely to go wrong in the future. I don't have strong opinions here, but to me it seems like putting them into materialSampleID is the most straightforward thing to do, even if it's a bit confusing we can describe the link in the metadata.

kkongshavn commented 1 year ago

In MUSIT, Invertebrate collections, UiB has UUIDs in the MUSIT "UUID" field, which are currently published to GBIF in the materialSampleID field.

So this has essentially become the materialSampleID field I guess?

Anyway, then I need a way of knowing which records should get occurrenceIDs from the UUID field and which records should get the occurrenceID darwin core triplet constructed from institutionCode, collectionCode and catalogNumber. Also we presumably need some other UUID to put in the materialSampleID field, so it's consist with the other records in the Invertebrate collections, UiB dataset.

So maybe do what I suggested before and include that UUID plus another UUID in the UUIDs in the MUSIT UUID field (like "c1f4134a-3777-4e48-8665-62e2fc2b9b96|fb9b6e93-592e-4f05-9cd4-d39a9d6e8abc", and I will always make the first UUID publish to GBIF materialSampleID, and the second to occurrenceID (option 4 in my list). If there isn't a '|' in the UUID field then the dwc triplet gets constructed as normal.

Why can't the UUID field become OccurenceID for all records? Trondheim has that, as far as I can see: This dataset.

rukayaj commented 1 year ago

Yes, we can change all the occurrenceIDs to become the UUID field but they say occurrenceID should remain consistent and ideally not change - see https://www.gbif.org/data-quality-requirements-occurrences#dcOccurrenceID

kkongshavn commented 1 year ago

If it was set up in a non-optimal way it must be better to correct that now rather than keep adding more data with the same issue?

rukayaj commented 1 year ago

Well it's not very good practice but it would certainly make things simpler. It would break the links in Bionomia though e.g. https://bionomia.net/0000-0002-5188-7305/specimens?datasetKey=ce697b4c-7803-47d8-81bd-5a172f4960b5&action=collected, but these could be fixed, you could add these recordedByIDs and identifiedByIDs back into the MUSIT database so we publish them.

Dag did say he preferred to have UUIDs for materialSampleID. Let's see what he says about this option (removing materialSampleID UUIDs so we do not publish anything for materialSampleID and using the UUIDs as occurrenceIDs for all records).

kkongshavn commented 1 year ago

Well it's not very good practice but it would certainly make things simpler. It would break the links in Bionomia though but these could be fixed, you could add these recordedByIDs and identifiedByIDs back into the MUSIT database so we publish them.

Since there's no way to doo batch edits (!?) in musit, I hope we can avoid anything that means I have to go back and make extensive edits - I simply don't have the hours to spend on it.

For the rest of it, let's see what Dag thinks.

rukayaj commented 1 year ago

Since there's no way to doo batch edits (!?) in musit, I hope we can avoid anything that means I have to go back and make extensive edits - I simply don't have the hours to spend on it.

There isn't!? I thought there was with spreadsheet upload. @vidarbakken do you know?

vidarbakken commented 1 year ago

There is no possibilties for batch editing in MUSIT. Stein/Svein can probably do some editing directly in the database,

dagendresen commented 1 year ago

Dataset Registration date October 19, 2020, https://doi.org/10.15468/f2y3bf

If these species occurrences have been published with DwC-triplet occurrenceIDs since October 2020 (??) then changing the occurrenceID identifier strings (to UUIDs) would unfortunately break a lot of links. (Objectively, I think that using UUIDs as identifiers is superior to using DwC-triplets, however breaking this is of course not good).

The materialSampleIDs actually go nowhere serious in GBIF (nor in Bionomia) yet... and changing them would not yet really break much - yet.

If some of the UUIDs are already used as occurrenceIDs (470 records?), then using them as materialSampleIDs is a really bad idea! We should not do that? Even if the materialSampleIDs are not really used by GBIF yet - they hopefully will be eventually. And the same identifier string identifying both a species occurrence and a specimen would be very bad :-)

kkongshavn commented 1 year ago

If some of the UUIDs are already used as occurrenceIDs (470 records?), then using them as materialSampleIDs is a really bad idea!

And yet, that was what I was told to do. How do we solve this, then?

dagendresen commented 1 year ago

make the first UUID publish to GBIF materialSampleID, and the second to occurrenceID (option 4 in my list) (...) I don't really like the idea of doing it this way as it will make data migration (i.e. to Specify) more likely to go wrong in the future.

Fully agree! Looks like a pain to make and a high risk of future failure! At least the DwC-triplets as occurrenceID is a general rule that we could more easily clean up later.

... maybe easier to make a Zoom call?

dagendresen commented 1 year ago

And yet, that was what I was told to do. How do we solve this, then?

I think we suggested reusing the same UUID-strings as was used as the occurrenceID as the occurrenceID in the UiB invertebrate dataset... not using the same UUID-strings as materialSampleID! (I think we made a big mess of misunderstanding each other here - sorry!!)

The least disruption might be to maintain the DwC-triplets as occurrenceIDs - and if possible find a way to put the 470 reused UUIDs into associatedOccurrences ...?

Alternatively, replace the 470 reused UUIDs with new clean UUIDs (not used before) and map these to materialSampleID ...? And wait for a better CMS that can handle both occurrenceID and materialSampleID in a more appropriate manner ...? We could also simply wait for the DiSSCo Digital Specimen service to become ready, and annotate the specimens with the correct occurrenceIDs here...?

kkongshavn commented 1 year ago

Sure, we can have a Zoom next week - I am free after lunch both Mon and Tuesday

rukayaj commented 1 year ago

replace the 470 reused UUIDs with new clean UUIDs (not used before) and map these to materialSampleID ...?

That sounds like the best option to me, but then there will not be any link between the two datasets. We could put the linked occIDs into occurrenceRemarks, perhaps, with an explanation?

kkongshavn commented 1 year ago

replace the 470 reused UUIDs with new clean UUIDs (not used before) and map these to materialSampleID ...?

That sounds like the best option to me, but then there will not be any link between the two datasets. We could put the linked occIDs into occurrenceRemarks, perhaps, with an explanation?

The I have to delete the entry in musit and re-upload it, won't that also mess with existing links? (Plus, it's more work for both me and Vidar)

kkongshavn commented 1 year ago

replace the 470 reused UUIDs with new clean UUIDs (not used before) and map these to materialSampleID ...?

That sounds like the best option to me, but then there will not be any link between the two datasets. We could put the linked occIDs into occurrenceRemarks, perhaps, with an explanation?

The I have to delete the entry in musit and re-upload it, won't that also mess with existing links? (Plus, it's more work for both me and Vidar)

And also, then what was the point of this whole discussion - won't we lose the connection between the IMR data and musit data completely then?

rukayaj commented 1 year ago

If we have the linked UUIDs in the occurrenceRemarks of the new records then there's a link there? I can feel your frustration, I am very sorry about this! Practically speaking it doesn't really make much difference at all where we put them to be honest.

kkongshavn commented 1 year ago

If we have the linked UUIDs in the occurrenceRemarks of the new records then there's a link there? I can feel your frustration, I am very sorry about this! Practically speaking it doesn't really make much difference at all where we put them to be honest.

It is much nicer when things go smoothly! 😉 I still think a call is quicker than this, so let's meet?

rukayaj commented 1 year ago

Ok so the decision after the meeting:

We currently are publishing:

We will change this to be like:

We also need to send a list of the changed occurrenceIDs to GBIF in this format: old occurrenceID | GBIF occurrence key | new occurrenceID

rukayaj commented 1 year ago

Also relevant is this thread #29

rukayaj commented 1 year ago

https://github.com/gbif/portal-feedback/issues/4585

dagendresen commented 1 year ago

See GBIF technical helpdesk hour https://vimeo.com/786853690 at 13m35s

Can we link old IDs to new ones? Yes, send a table to helpdesk the best way for us to fix this is to have the old occurrenceIDs and the new ones

rukayaj commented 1 year ago

I thought maybe it would be easier for them to do it with a sql query, but anyway I added the list of IDs to the other issue.

rukayaj commented 1 year ago

@kkongshavn there are two records with the same UUID: 751615e7-d220-4421-bae5-ec718c3d6b75. Can you take a look and change one, please? :)

kkongshavn commented 1 year ago

@kkongshavn there are two records with the same UUID: 751615e7-d220-4421-bae5-ec718c3d6b75. Can you take a look and change one, please? :)

Working on it, but I don't have access to edit UUIDs in musit (I just found out), so I have asked Vidar how to proceed

kkongshavn commented 1 year ago

@rukayaj one of the UUIDs has been changed in MUSIT just now: ZMBN-EVERT-152543 oppdatert med UUID 59521c44-4c09-4bf9-8a1a-3e1927dfc0b9

rukayaj commented 1 year ago

Ok great! It looks like the source I pull the GBIF data from hasn't been updated yet though. I think it only gets updated at midnight every night, so I will give it a try again tomorrow :)

rukayaj commented 1 year ago

I updated that dataset and the uuids are now in place.

rukayaj commented 1 year ago

Seems to be ok now to me. If there are any problems just reopen this issue, @kkongshavn