Open FedorSteeman opened 3 years ago
Plan of attack:
Most straightforward collection to start with is Biocultural Botany:
Next up: Herpetology.
@DanBIF Isabel: I'm ready to do the test swap of Biocultural Botany. With regards to Herpetology, we need to go through a couple of things first.
@DanBIF Shall I attempt to move the Biocultural dataset on my own? It's pretty low risk...
@FedorSteeman By all means, do :-) sounds as a good idea, with me being so hung-up with other things. Reach out if you run into trouble
Done. The original data set is using the new endpoint and can (still) be found here: https://www.gbif.org/dataset/acf5050c-3a41-4345-a660-652cb9462379
The one I created last year from the new endpoint has now been deleted (archived) pointing to the above: https://www.gbif.org/dataset/a909430a-a4c0-47ee-b445-e29bc2bcc9e3
Next up: Herpetology.
There are three related datasets in GBIF to consider, the main one, plus " The Danish Newt Collection" and "Tanzanian Vertebrate Collection" (to my recall a mix including birds and amphibians and snakes) - look here, marked up "in progress": https://docs.google.com/spreadsheets/d/10oxs2h-3cK3d5UzFY9r9glNFkxk24nkvNz_rD-VhqwU/edit#gid=223577792 The datasets are "broken" when it comes to images, as they were sitting on the old ZMUC-server, so it will be great to recover the images, get it all into specify, retire the "dead" datasets and republish.
Thanks for pointing this out! To ensure retention of linkage broken at occurrence level, I need to adopt the Occurrence IDs (UUID) from the datasets where the objects in question are already published.
I downloaded the following dataset for extracting occurrence IDs: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a
Also working with @markscherz and colleagues for preparing the complete shift from FileMaker to Specify.
I noticed that the occurrence IDs associated with those specimens are not GUIDs, but URNs derived from the old ZMUC catalog numbers. On top of that, I cannot match up these ZMUC numbers with anything I have in Specify currently. It's possible that these are in FileMaker still. If so, I can ask GBIF to replace the current occurrence IDs with GUIDs and when that is done, I can set these GUIDs for the objects in question after (or before) I import these into Specify. This way, we'd (probably) retain linkage on record level. If not, then I wouldn't have a lot to go on otherwise.
@DanBIF I'm sorry if this is confusing... 😅
While waiting for Herpetology, I have turned my attention to other datasets that need to be merged, most notably Entomology.
This is the newly created dataset dynamically generated from Specify for Entomology: https://doi.org/10.15468/678mtv
The old static dataset where the Specify data should be merged into, archiving the above: https://doi.org/10.15468/nnobcm
I've already updated the occurrence IDs in Specify to match the records already published in the latter. Tests in UAT show that GBIF IDs are kept intact with this done.
I have knowledge to share on all the entomology datasets you have marked up in google sheet https://docs.google.com/spreadsheets/d/10oxs2h-3cK3d5UzFY9r9glNFkxk24nkvNz_rD-VhqwU/edit#gid=223577792 some of them are based on private collectors. we can meet up Wednesday 2/2 if you like?
I will be in Jutland all week this week, so maybe perhaps next week? My biggest issues with the remaining entomology datasets are:
If these overlap with what's already in Specify and out on GBIF, then the occurrence IDs on GBIFs side may have to be updated. If this breaks the occurence IDs no matter what, then we may just as well retire these datasets and live with broken links on record level. At least on dataset level, all references will lead to the main NHMD entomology one.
We've reached the Tingidae data set (https://doi.org/10.15468/wpw5ly). Done so far:
Next steps:
Next up: Piesmatidae
A separate ticket has been created for the Tingidae-process here: #141
@DanBIF Rather than creating separate tickets for each dataset to be imported, maybe it's best after all to limit these to this ticket and thread, as we will only work on importing and republishing one dataset at a time anyway. (?)
As from #141 :
We need to do some extra man-handling to link these records. More info here: https://data-blog.gbif.org/post/clustering-occurrences/
Next steps:
Started working on replacing static source file with dynamic connection with our Specify database for NHMD Invertebrates https://www.gbif.org/dataset/58fc397b-eee4-4138-8a5f-50f9dfb00216 Have opened issue with GBIF helpdesk because of issue with occurence ID Action @AstridBVW to:
@FedorSteeman to:
Marie at GBIF helped us out with the occurence ID: removed the dash from the catalogue numbers (like NHMD-218154 to NHMD218154) so that when the new data comes in, the GBIFIDs are maintained. So now the data for the original dataset https://www.gbif.org/dataset/58fc397b-eee4-4138-8a5f-50f9dfb00216 comes from the Specify-generated archive and the occurrences kept their GBIFID/occurrence URL except they also have occurrenceIDs.
@FedorSteeman reuse extensive metadata from static dataset. Metadata available here: N:\SCI-SNM-DigitalCollections\DanBIF Retired Datasets\Invertebrates Excl. Entomology static dataset
INVZOOL has 5000+ cases with collecting event start AND end date which would be nice to be able to map as interval to GBIF. Read more: https://dwc.tdwg.org/terms/#dwc:eventDate
Next up for incorporation in Specify and pension on GBIF is P.W.Lund: BEWARE that GBIF-version may have more info such as lat./long. https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a file:///N:/SCI-SNM-zmuc.dk/VerWeb/Lund/lund_mammals.html
Revisting the PW Lund issue that apparently was shelved for almost a year. @DanBIF Isabel and I discovered that the dataset was distilled from the old html page and/or DanBIF dataset and ready for import. After import the images could be imported through a separate proces.
Import file in question: PW-LUND-MAMMALS v3.csv
For some reason, this was never done. Looking at the prepared import file, it is 10 records shorter (78) than the original 88. We also discovered that some PW Lund mammals of at least one identical species is already present.
I will:
It appears the PW Lund Mammal data was already imported June last year but not set to be published while awaiting retirement procedures.
Will initiate retirement of old dataset and thereafter publication asap.
From: Fedor Steeman Sent: 29. juni 2023 09:41 To: Daniel Klingberg Johansson [dkjohansson@snm.ku.dk](mailto:dkjohansson@snm.ku.dk); icalabuig@snm.ku.dk; Astrid Blok van Witteloostuijn [astrid.blok@snm.ku.dk](mailto:astrid.blok@snm.ku.dk) Cc: Zsuzsanna Papp [zsuzsanna.papp@snm.ku.dk](mailto:zsuzsanna.papp@snm.ku.dk) Subject: PW Lund mammals
Hej alle,
For øvrigt så fik jeg importeret de manglende rækker fra PW Lunds pattedyr til Specify.
Jeg har givet dem projectNumber ”PW Lund Mammals” så de er nemme at finde. Jeg har oprettet en query til Daniel med samme navn, så han bare kan gå til at gennemse.
GBIF-teknisk har jeg overført de occurrenceIds af det oprindelige statiske datasæt til de tilsværende rækker. Det statiske datasæt er stadig online her: https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a
For nu har jeg sat de tilsvarende rækker i Specify ikke til at blive publiceret, indtil ovenstående datasæt er pensioneret. Vi vil jo gerne beholde record-level linkage således at GBIF id for de pågældende rækker bliver identiske med dem fra det statiske. Eventuelt må vi bede GBIF om at sørge for at dette foregår korrekt.
Med venlig hilsen,
PW Lund mammals:
We retired the old static dataset of PW Lund Mammals here: https://www.gbif.org/dataset/84d8287e-f762-11e1-a439-00145eb45e9a
The corresponding records are now part of the NHMD Mammalogy collection, and have now been published and merged as part of the dynamic dataset here: https://www.gbif.org/dataset/78b270c5-a5fe-4f1d-b87e-eb0dd7b7ae02
Also mapped Specify "projectNumber" to DwC "datasetName" so that the records can be filtered using: https://www.gbif.org/occurrence/search?advanced=1&dataset_name=PW%20Lund%20Mammals
Next target is Tanzanian Vertebrate Collection: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a
To eventually be retired to: https://www.gbif.org/dataset/8c834f97-c5df-4280-9623-86594979f91a https://www.gbif.org/dataset/e5c5cad9-b987-4bb4-b07b-85488c5fdd80
Examples of corresponding records: https://www.gbif.org/occurrence/115958982 & https://www.gbif.org/occurrence/3778580506 (Afrixalus sylvaticus AMPH)
https://www.gbif.org/occurrence/115958973 & https://www.gbif.org/occurrence/3314887583 (Batis crypta; AVES)
The original zmuc.dk site has been backed up to an network drive here: N:\SCI-SNM-zmuc.dk\VerWeb\Tanzanian_Vertebrates
We first focused on the Tanzanian bird collection specifically of which the static data set is here: https://www.gbif.org/dataset/af3bce08-0599-45a6-9bfc-08188bcd868e
Looking at a single trial taxon (Linurgus olivaceus subsp kilimensis) revealed that apparently all these birds specimen records had already been imported into Specify in 2010 and the occurrence ids were synchronized between this static dataset and the dynamic one over here. This was probably done right after this point.
However, despite the occurence IDs being identical the GBIF IDs are not, e.g.: https://www.gbif.org/occurrence/455916128 vs https://www.gbif.org/occurrence/4164075380
Also, we noticed that blood sample information stored in occurence remarks in the static dataset, appeared to be lacking at a visible level in the dynamic dataset, although it was retained in a text1 field re-captioned as "Imported from Papis". See at the bottom of the Specify entry here: https://specify-snm.science.ku.dk/specify/view/collectionobject/169529/
We need to find out how we can retire the static bird dataset if the GBIF IDs are not identical. Also we need to find out if the blood sample information should be moved to a more visible field, probably under the Tissue preparation event.
I will:
Next time we shall look at the Tanzanian Vertebrates datasets
Results so far:
Blood sample info was already transferred to custom fields for each preparation, e.g,:
https://specify-snm.science.ku.dk/specify/view/preparation/172024/
GBIF Needs occurenceIDs pairs for the conversion so will send asap.
Tanzanian birds dataset has been retired by GBIF using the occurrenceIDs.
I think now next up is Tanzanian Vertebrate Collection: https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a see https://github.com/NHMDenmark/DanSpecify/issues/101#issuecomment-2060916482
Retired Tanzanian Vertebrate Collection (https://www.gbif.org/dataset/86523cda-f762-11e1-a439-00145eb45e9a)
Waiting for indexing to remove all occurrences too.
Occurrences disappeared after a little while. When retiring, in https://registry.gbif.org/dataset/search 1) Set "is duplicate of" dataset key or name for dataset that replaces the dataset to be retired 2) If relevant, add some text in dataset description on why/how retired/"replaced by"-details 3) Three dots top right corner: delete dataset 4) crawl the dataset
We've looked at the following datasets that constitute herpetological specimens:
The bulk of our Herpetology data is still to be imported into Specify and @markscherz has been preparing their data from FileMaker for import. Once that is done, we can start preparing the retirement of the above datasets, meaning:
Next, we checked out the entomological datasets related to the project of Kimmie Mønbo's thesis on lacebugs:
NOTE: These static datasets should not be retired, because they should be retained in their integrity. They contain a mix of specimens from NHMA, NHMD and private collectors. For Mønbo's thesis the dataset should be retained as it it referenced from there.
Tingidae has already been imported into NHMDs Specify (Though not into NHMA's). Piesmatidae has not.
Todo:
As we've started publishing data directly from Specify to GBIF we're creating an overlap with data sets that already we're published as static files by DanBIF in earlier years. Many of these data sets already have multiple citations. To retain citation linkage, we need to diligently relocate the data sets in questions by redirecting end points for some, while archiving others. To enure linkage is retained on occurrence record level, the occurrence IDs already set need to be adopted into the new source, which is Specify.