Merge DanBIF publish datasets from Specify

FedorSteeman commented 3 years ago

As we've started publishing data directly from Specify to GBIF we're creating an overlap with data sets that already we're published as static files by DanBIF in earlier years. Many of these data sets already have multiple citations. To retain citation linkage, we need to diligently relocate the data sets in questions by redirecting end points for some, while archiving others. To enure linkage is retained on occurrence record level, the occurrence IDs already set need to be adopted into the new source, which is Specify.

FedorSteeman commented 3 years ago

Plan of attack:

Get Specify7 to spit out the institution prefix in front of the catalognumber. ("NHMD-nnnnnnnnn")
For each collection:
- overwrite GUID fields with occurence GUIDs already in the originally published dataset (if any)
- Export, change endpoint of original datasete and re-ingest
- Check wether the GBIF ids remained the same for selected items
- @icalabuig to fix things on DanBIF server end

Most straightforward collection to start with is Biocultural Botany:

Next up: Herpetology.

original: https://www.gbif.org/dataset/cb643105-2e6b-403d-a23b-2c8128d1f97c

FedorSteeman commented 2 years ago

@DanBIF Isabel: I'm ready to do the test swap of Biocultural Botany. With regards to Herpetology, we need to go through a couple of things first.

FedorSteeman commented 2 years ago

@DanBIF Shall I attempt to move the Biocultural dataset on my own? It's pretty low risk...

DanBIF commented 2 years ago

@FedorSteeman By all means, do :-) sounds as a good idea, with me being so hung-up with other things. Reach out if you run into trouble

FedorSteeman commented 2 years ago

Done. The original data set is using the new endpoint and can (still) be found here: https://www.gbif.org/dataset/acf5050c-3a41-4345-a660-652cb9462379

The one I created last year from the new endpoint has now been deleted (archived) pointing to the above: https://www.gbif.org/dataset/a909430a-a4c0-47ee-b445-e29bc2bcc9e3

FedorSteeman commented 2 years ago

Next up: Herpetology.

DanBIF commented 2 years ago

There are three related datasets in GBIF to consider, the main one, plus " The Danish Newt Collection" and "Tanzanian Vertebrate Collection" (to my recall a mix including birds and amphibians and snakes) - look here, marked up "in progress": https://docs.google.com/spreadsheets/d/10oxs2h-3cK3d5UzFY9r9glNFkxk24nkvNz_rD-VhqwU/edit#gid=223577792 The datasets are "broken" when it comes to images, as they were sitting on the old ZMUC-server, so it will be great to recover the images, get it all into specify, retire the "dead" datasets and republish.

FedorSteeman commented 2 years ago

Thanks for pointing this out! To ensure retention of linkage broken at occurrence level, I need to adopt the Occurrence IDs (UUID) from the datasets where the objects in question are already published.

I downloaded the following dataset for extracting occurrence IDs: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a

Also working with @markscherz and colleagues for preparing the complete shift from FileMaker to Specify.

FedorSteeman commented 2 years ago

I noticed that the occurrence IDs associated with those specimens are not GUIDs, but URNs derived from the old ZMUC catalog numbers. On top of that, I cannot match up these ZMUC numbers with anything I have in Specify currently. It's possible that these are in FileMaker still. If so, I can ask GBIF to replace the current occurrence IDs with GUIDs and when that is done, I can set these GUIDs for the objects in question after (or before) I import these into Specify. This way, we'd (probably) retain linkage on record level. If not, then I wouldn't have a lot to go on otherwise.

@DanBIF I'm sorry if this is confusing... 😅

FedorSteeman commented 2 years ago

While waiting for Herpetology, I have turned my attention to other datasets that need to be merged, most notably Entomology.

This is the newly created dataset dynamically generated from Specify for Entomology: https://doi.org/10.15468/678mtv

The old static dataset where the Specify data should be merged into, archiving the above: https://doi.org/10.15468/nnobcm

I've already updated the occurrence IDs in Specify to match the records already published in the latter. Tests in UAT show that GBIF IDs are kept intact with this done.

DanBIF commented 2 years ago

I have knowledge to share on all the entomology datasets you have marked up in google sheet https://docs.google.com/spreadsheets/d/10oxs2h-3cK3d5UzFY9r9glNFkxk24nkvNz_rD-VhqwU/edit#gid=223577792 some of them are based on private collectors. we can meet up Wednesday 2/2 if you like?

FedorSteeman commented 2 years ago

I will be in Jutland all week this week, so maybe perhaps next week? My biggest issues with the remaining entomology datasets are:

If these correspond to collection objects in our collection then it's hard to see which since they use a strange numbering system
The occurrence IDs appear to be a combination of collection codes, subcodes and serial numbers, whereas I'm used to GUIDs

If these overlap with what's already in Specify and out on GBIF, then the occurrence IDs on GBIFs side may have to be updated. If this breaks the occurence IDs no matter what, then we may just as well retire these datasets and live with broken links on record level. At least on dataset level, all references will lead to the main NHMD entomology one.

FedorSteeman commented 2 years ago

We've reached the Tingidae data set (https://doi.org/10.15468/wpw5ly). Done so far:

Downloaded the complete dataset to be archived
Sorting out issued encountered during attempted initial import

Next steps:

Finish import into NHMD
Import into NHMA
Consider creating separate database/collection for private collection data
Edit the static datasets metadata to indicate that those downloads exist (put the DOI/links to the downloads)
Send GBIF the DOI to those downloads for preservation
In addition to that, include the links to the corresponding dynamic datasets in the static datasets metadata.
Delete the static datasets. The metadata will be preserved on GBIF.org. Any user that will arrive on the static dataset page, will be able to read the metadata and find the link to the dynamic datasets and downloads.
Communicate the new status (DOIs etc) to Kimmie Møenbo Jensen

Next up: Piesmatidae

FedorSteeman commented 2 years ago

A separate ticket has been created for the Tingidae-process here: #141

@DanBIF Rather than creating separate tickets for each dataset to be imported, maybe it's best after all to limit these to this ticket and thread, as we will only work on importing and republishing one dataset at a time anyway. (?)

FedorSteeman commented 2 years ago

As from #141 :

We need to do some extra man-handling to link these records. More info here: https://data-blog.gbif.org/post/clustering-occurrences/

Next steps:

Attempt to align more data fields to achieve clustering and/or
Use the associatedOccurrences field and resource relationship extension to achieve record-level linkage

DanBIF commented 1 year ago

Started working on replacing static source file with dynamic connection with our Specify database for NHMD Invertebrates https://www.gbif.org/dataset/58fc397b-eee4-4138-8a5f-50f9dfb00216 Have opened issue with GBIF helpdesk because of issue with occurence ID Action @AstridBVW to:

find additoinal mapping options for fields (remarks, expedition) we want to share - if relevant, ask permission to be on the safe side.

@FedorSteeman to:

[ ] Fix output date to be YYYY-MM-DD (if start and end then: YYYY-MM-DD/YYYY-MM-DD)
[x] Fix Metadata to include Originator and Creator (see static dataset for inspiration)

DanBIF commented 1 year ago

Marie at GBIF helped us out with the occurence ID: removed the dash from the catalogue numbers (like NHMD-218154 to NHMD218154) so that when the new data comes in, the GBIFIDs are maintained. So now the data for the original dataset https://www.gbif.org/dataset/58fc397b-eee4-4138-8a5f-50f9dfb00216 comes from the Specify-generated archive and the occurrences kept their GBIFID/occurrence URL except they also have occurrenceIDs.

@FedorSteeman reuse extensive metadata from static dataset. Metadata available here: N:\SCI-SNM-DigitalCollections\DanBIF Retired Datasets\Invertebrates Excl. Entomology static dataset

FedorSteeman commented 1 year ago

INVZOOL has 5000+ cases with collecting event start AND end date which would be nice to be able to map as interval to GBIF. Read more: https://dwc.tdwg.org/terms/#dwc:eventDate

[ ] Map date intervals somehow (hack formatter???)

DanBIF commented 1 year ago

Next up for incorporation in Specify and pension on GBIF is P.W.Lund: BEWARE that GBIF-version may have more info such as lat./long. https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a file:///N:/SCI-SNM-zmuc.dk/VerWeb/Lund/lund_mammals.html

FedorSteeman commented 8 months ago

Revisting the PW Lund issue that apparently was shelved for almost a year. @DanBIF Isabel and I discovered that the dataset was distilled from the old html page and/or DanBIF dataset and ready for import. After import the images could be imported through a separate proces.

Import file in question: PW-LUND-MAMMALS v3.csv

For some reason, this was never done. Looking at the prepared import file, it is 10 records shorter (78) than the original 88. We also discovered that some PW Lund mammals of at least one identical species is already present.

I will:

[ ] Check the dataset and solve the mystery of the record count mismatch
[ ] Investigate whether (some of) these records were already imported
[ ] Contact Daniel Klingberg to coordinate this effort
[ ] Import dataset when all of the above has been cleared
[ ] Coordinate with @DanBIF Isabel to retire old dataset

FedorSteeman commented 6 months ago

It appears the PW Lund Mammal data was already imported June last year but not set to be published while awaiting retirement procedures.

Will initiate retirement of old dataset and thereafter publication asap.

From: Fedor Steeman Sent: 29. juni 2023 09:41 To: Daniel Klingberg Johansson [dkjohansson@snm.ku.dk](mailto:dkjohansson@snm.ku.dk); icalabuig@snm.ku.dk; Astrid Blok van Witteloostuijn [astrid.blok@snm.ku.dk](mailto:astrid.blok@snm.ku.dk) Cc: Zsuzsanna Papp [zsuzsanna.papp@snm.ku.dk](mailto:zsuzsanna.papp@snm.ku.dk) Subject: PW Lund mammals

Hej alle,

For øvrigt så fik jeg importeret de manglende rækker fra PW Lunds pattedyr til Specify.

Jeg har givet dem projectNumber ”PW Lund Mammals” så de er nemme at finde. Jeg har oprettet en query til Daniel med samme navn, så han bare kan gå til at gennemse.

GBIF-teknisk har jeg overført de occurrenceIds af det oprindelige statiske datasæt til de tilsværende rækker. Det statiske datasæt er stadig online her: https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a

For nu har jeg sat de tilsvarende rækker i Specify ikke til at blive publiceret, indtil ovenstående datasæt er pensioneret. Vi vil jo gerne beholde record-level linkage således at GBIF id for de pågældende rækker bliver identiske med dem fra det statiske. Eventuelt må vi bede GBIF om at sørge for at dette foregår korrekt.

Med venlig hilsen,

FedorSteeman commented 6 months ago

PW Lund mammals:

We retired the old static dataset of PW Lund Mammals here: https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a

The corresponding records are now part of the NHMD Mammalogy collection, and have now been published and merged as part of the dynamic dataset here: https://www.gbif.org/dataset/78b270c5-a5fe-4f1d-b87e-eb0dd7b7ae02

Also mapped Specify "projectNumber" to DwC "datasetName" so that the records can be filtered using: https://www.gbif.org/occurrence/search?advanced=1&dataset_name=PW%20Lund%20Mammals

FedorSteeman commented 6 months ago

Next target is Tanzanian Vertebrate Collection: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a

To eventually be retired to: https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a https://www.gbif.org/dataset/e5c5cad9-b987-4bb4-b07b-85488c5fdd80

Examples of corresponding records: https://www.gbif.org/occurrence/115958982 & https://www.gbif.org/occurrence/3778580506 (Afrixalus sylvaticus AMPH)

https://www.gbif.org/occurrence/115958973 & https://www.gbif.org/occurrence/3314887583 (Batis crypta; AVES)

FedorSteeman commented 5 months ago

The original zmuc.dk site has been backed up to an network drive here: N:\SCI-SNM-zmuc.dk\VerWeb\Tanzanian_Vertebrates

We first focused on the Tanzanian bird collection specifically of which the static data set is here: https://www.gbif.org/dataset/af3bce08-0599-45a6-9bfc-08188bcd868e

Looking at a single trial taxon (Linurgus olivaceus subsp kilimensis) revealed that apparently all these birds specimen records had already been imported into Specify in 2010 and the occurrence ids were synchronized between this static dataset and the dynamic one over here. This was probably done right after this point.

However, despite the occurence IDs being identical the GBIF IDs are not, e.g.: https://www.gbif.org/occurrence/455916128 vs https://www.gbif.org/occurrence/4164075380

Also, we noticed that blood sample information stored in occurence remarks in the static dataset, appeared to be lacking at a visible level in the dynamic dataset, although it was retained in a text1 field re-captioned as "Imported from Papis". See at the bottom of the Specify entry here: https://specify-snm.science.ku.dk/specify/view/collectionobject/169529/

We need to find out how we can retire the static bird dataset if the GBIF IDs are not identical. Also we need to find out if the blood sample information should be moved to a more visible field, probably under the Tissue preparation event.

I will:

Write an e-mail to GBIF asking how we can retire the old Tanzanian bird dataset in light of the above
Write an e-mail to Pete Hosner asking about the blood sample information

Next time we shall look at the Tanzanian Vertebrates datasets

FedorSteeman commented 5 months ago

Results so far:

GBIF just needs the occurence IDs and then they will enable the retirement (deletion) of the old dataset
Pete Hosner replied they would like the blood sample data transferred to the remarks field of the relevant preparation event (presumably the tissue)

FedorSteeman commented 4 months ago

Blood sample info was already transferred to custom fields for each preparation, e.g,:

https://specify-snm.science.ku.dk/specify/view/preparation/172024/

GBIF Needs occurenceIDs pairs for the conversion so will send asap.

FedorSteeman commented 4 months ago

Tanzanian birds dataset has been retired by GBIF using the occurrenceIDs.

DanBIF commented 4 months ago

I think now next up is Tanzanian Vertebrate Collection: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a see https://github.com/NHMDenmark/DanSpecify/issues/101#issuecomment-2060916482

FedorSteeman commented 4 months ago

Retired Tanzanian Vertebrate Collection (https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a)

Waiting for indexing to remove all occurrences too.

DanBIF commented 4 months ago

Occurrences disappeared after a little while. When retiring, in https://registry.gbif.org/dataset/search 1) Set "is duplicate of" dataset key or name for dataset that replaces the dataset to be retired 2) If relevant, add some text in dataset description on why/how retired/"replaced by"-details 3) Three dots top right corner: delete dataset 4) crawl the dataset

[x] 5) delete the static dataset from the IPT - to be done when AU IPT is up and running again

FedorSteeman commented 3 months ago

We've looked at the following datasets that constitute herpetological specimens:

Amphibians and Reptiles collection at the Natural History Museum of Denmark (SNM) https://doi.org/10.15468/uakqta
The Danish Newt Collection https://doi.org/10.15468/ypqrho

The bulk of our Herpetology data is still to be imported into Specify and @markscherz has been preparing their data from FileMaker for import. Once that is done, we can start preparing the retirement of the above datasets, meaning:

[ ] Consider setting newly imported specimens as not published initially in relation with dataset retirement
[ ] Consider how to synchronize occurrence IDs from existing records overlapping between static and dynamic datasets

FedorSteeman commented 3 months ago

Next, we checked out the entomological datasets related to the project of Kimmie Mønbo's thesis on lacebugs:

Danish Tingidae https://doi.org/10.15468/wpw5ly
Danish Piesmatidae https://doi.org/10.15468/51dpya

NOTE: These static datasets should not be retired, because they should be retained in their integrity. They contain a mix of specimens from NHMA, NHMD and private collectors. For Mønbo's thesis the dataset should be retained as it it referenced from there.

Tingidae has already been imported into NHMDs Specify (Though not into NHMA's). Piesmatidae has not.

Todo:

[ ] Correct Piesmatidae institution code from SNM to NHMD & republish @DanBIF
[ ] Import NHMD Piesmatidae into Specify
[ ] Import NHMA Piesmatidae into Specify
[ ] Import NHMA TIngidae into Specify
[ ] @DanBIF to check if DOIs are actually used in her thesis and published paper
[ ] @DanBIF to check amount of specimens from private collectors & observations from literature

NHMDenmark / DanSpecify

Merge DanBIF publish datasets from Specify #101