NHMDenmark / DaSSCo-Transcription

Work on transcription of specimen data from images as part of mass digitisation workflows and pipelines
1 stars 0 forks source link

Prepare data for v1 (DaSSCo dung beetles) #12

Open PipBrewer opened 2 months ago

PipBrewer commented 2 months ago

This task involves preparing the data for using on v1 of the transcription platform. This is the dung beetles imaged by DaSSCo for Aslak and Alexey's research project. The information is needed urgently for a research project.

This task should be completed by October 2024

PipBrewer commented 2 months ago

Email conversation with curators and collections managers regarding proposed changes to Specify to facilitate this project: 20240916 RE_ Meeting notes about transcribing Danish dung beetles (with collection managers on 15_09_2024) involving changes to Specify UI.pdf

Waiting on test set up showing the additions, for final approval before making the changes in Specify live.

PipBrewer commented 2 months ago

We need to figure out what to do with the images, as the image pipeline is not up and running and Specify is not yet connected to the Asset Registry System (the replacement Specify web asset server).

We could just upload images to Specify now, which will store the images on the current web asset server. However, they would need some processing (including barcodes to be read, cropping, downsampling etc). This would be tedious to do manually for this many specimens (c. 5,000).

We really need an image processing script run on these images and their metadata updating, and then importing to Specify (and then removing from the N drive). Need a workflow for this.

To start with, it would be good to be able to identify where all of the images for the dung beetle project are located (which folders exactly - ideally as a list), to read the barcodes and match them with the image guids, look at the images and see what kind of processing is definitely needed, check the data associated with these in Specify to see if this data set has any complications like Multi-Object Specimens, Multi-Specimen Objects etc.

AstridBVW commented 1 month ago

The mock up for the adjustments to the Specify UI was finished last week and shown to the collection managers yesterday. They approved the mock up, and it has now been implemented in Specify live. The following adjustments have been made:

Locality UI

Collecting Information UI

AstridBVW commented 1 month ago

We need to figure out what to do with the images, as the image pipeline is not up and running and Specify is not yet connected to the Asset Registry System (the replacement Specify web asset server).

We could just upload images to Specify now, which will store the images on the current web asset server. However, they would need some processing (including barcodes to be read, cropping, downsampling etc). This would be tedious to do manually for this many specimens (c. 5,000).

I think we should store the images on the current web asset server so we can get going with the transcription. Didn’t Thomas work on the image processing before he left? I think he had scripts for reading barcodes (and associating them with the GUID in a database) and downsampling (creating JPEGs). I don’t know about cropping. Who has taken over the image processing after Thomas?

We really need an image processing script run on these images and their metadata updating, and then importing to Specify (and then removing from the N drive). Need a workflow for this.

Again, who should be working on this?

To start with, it would be good to be able to identify where all of the images for the dung beetle project are located (which folders exactly - ideally as a list), to read the barcodes and match them with the image guids, look at the images and see what kind of processing is definitely needed, check the data associated with these in Specify to see if this data set has any complications like Multi-Object Specimens, Multi-Specimen Objects etc.

I can work on identifying in which folders the dung beetle images are located and make a list. Reading the barcodes and matching with the image GUIDs, this is was Thomas was doing. Processing of the images, would be good to get some input from someone else. I can check the associated Specify records for any complications.

AstridBVW commented 1 month ago

Other updates:

I have figured out how to use the database that Thomas made for the images to create a mapping file for importing the images to Specify and associating them with their respective records. So when the images are ready for import (after processing etc.), I will be ready to import them immediately.

I am working with Jan and Anders to include the localities from Aslak to their Locality-master list. Once that is completed, I will be able to import the list to Specify (and group).

Last I checked, not all the DigiApp exports containing dung beetle records had been imported to Specify. Once they have, I will group them in a record set.

AstridBVW commented 1 month ago

@PipBrewer You previously asked me to look into if we could use the JPEGs for transcription or if we needed the TiFFs instead. I looked at several of the JPEGs for dung beetles, and I think the quality is just fine for transcription.

AstridBVW commented 1 month ago

Allison is now working on reading the barcodes for all the images from WORKPIOF002 which was used for all the dung beetles. Once that is done (some time next week), I can use the barcode/GUID database to get a list of all the folders with images of dung beetles.

AstridBVW commented 1 month ago

All the DigiApp exports containing dung beetle records have now been imported, so I have grouped them in a record set in Specify, 7252 records in total.

PipBrewer commented 1 month ago

@AstridBVW At the moment there is no-one doing inage processing. I need to recruit (at least some short term help). First I want to have a look at the images themselves and see what the situation is. Can you point me to some examples?

AstridBVW commented 1 month ago

More fields need to be added to the Specify UI, Verbatim Date and Verbatim Date Source ( #19 ). We have agreed to use the Specify datamodel verbatimDate field and one of the available text fields for Verbatim Date Source. I will get approval from the collection managers and implement this.

AstridBVW commented 1 month ago

We have agreed to put the images on the Specify web asset server as long as ARS is not ready.

At the moment there is no-one doing inage processing. I need to recruit (at least some short term help). First I want to have a look at the images themselves and see what the situation is. Can you point me to some examples?

@PipBrewer All the images of dung beetles were taken using the WORKPIOF0002 setup. Imaging started late last year but ingestion was delayed for these images so the dates on the image folders on the N-drive starts from 2024-2-19. Imaging was finished late July. You will be able to find examples in the JPEG/WORKPIOF0002 folder from folder 2024-2-19 and on. Let me know if you need me to find specific images for you.

PipBrewer commented 1 month ago

To get images of dung beetles to Specify:

Attachment field values for dung beetles v1 test.xlsx

AstridBVW commented 1 month ago

List of folders with dung beetle images: I am almost done with this. I found some missing images and more ( #21 ), this affects 7 specimens. Once ticket 21 is complete, I will move on with the import of the images.

Specify data complications: I have looked into this. I do not see any complications. We do not have MOS for pinned insects. We have MSO but they are unique pr. record (i.e. 1 NHMD record pr. 1 container). I did find some MSO mistakes where one container is associated with more than one record. I will make sure to fix these (EDIT: they are now fixed).

AstridBVW commented 1 month ago

An update on the import of the images and their metadata:

I looked into if the workbench could be used to import the image metadata. It looked promising since it was possible to map to the attachment metadata fields. And it is indeed possible. I tested a couple of things. First, I tried batch uploading the images first and then importing the metadata through the workbench. This did work but also created duplicate metadata for the images. Next Fedor suggested placing the images directly on the web asset server instead of importing through the Specify UI. So I tested this, first placing the image files directly in the web asset server folder on the N-drive, and then importing the metadata through the workbench. And this worked! The first test just included the jpegs, but I did a second test where I did this for both tif and json files at the same time, and that also worked. The result in Specify is both the jpeg, tif and json files attached to the appropriate specimen record, and each of them are associated with their corresponding metadata.

@PipBrewer I need some things from you. The value you have put in the metadata field "credit" is too long, the limit is 64. Could you please decide how you want to shorten it and let me know. Also, the attachment field values are only for the jpegs and tifs. Could you please provide fixed values for the json files as well?

AstridBVW commented 3 weeks ago

More fields need to be added to the Specify UI, Verbatim Date and Verbatim Date Source ( #19 ). We have agreed to use the Specify datamodel verbatimDate field and one of the available text fields for Verbatim Date Source. I will get approval from the collection managers and implement this.

Fields for Verbatim Date and Verbatim Date Source have now been added to the Specify UI for NHMD Entomology.

PipBrewer commented 2 weeks ago

@AstridBVW In response to your queries:

The json files should not be attached to the catalogue records in Specify. They should however be migrated away from the usual place on the N drive at the same time as the images.

The credit should be: "Digitised by DaSSCo for Natural History Museum Denmark" This was approved by Kim and Nikolaj today by email.