Prepare data for v1 (DaSSCo dung beetles)

PipBrewer commented 2 months ago

This task involves preparing the data for using on v1 of the transcription platform. This is the dung beetles imaged by DaSSCo for Aslak and Alexey's research project. The information is needed urgently for a research project.

[x] Add new fields to Specify
[x] Produce downsampled images for upload (update: task cancelled)
[ ] Upload images to Specify and attach to specimen records
[x] Create group of object records in Specify
[ ] Map and import Aslak's locality list into Specify and group
[ ] Create project data set export for transcription platform
[ ] Create project in transcription platform with data
[ ] Check and test with a few records
[ ] Create brief protocol on how to prepare data for projects

This task should be completed by October 2024

PipBrewer commented 2 months ago

Email conversation with curators and collections managers regarding proposed changes to Specify to facilitate this project: 20240916 RE_ Meeting notes about transcribing Danish dung beetles (with collection managers on 15_09_2024) involving changes to Specify UI.pdf

Waiting on test set up showing the additions, for final approval before making the changes in Specify live.

PipBrewer commented 2 months ago

We need to figure out what to do with the images, as the image pipeline is not up and running and Specify is not yet connected to the Asset Registry System (the replacement Specify web asset server).

We could just upload images to Specify now, which will store the images on the current web asset server. However, they would need some processing (including barcodes to be read, cropping, downsampling etc). This would be tedious to do manually for this many specimens (c. 5,000).

We really need an image processing script run on these images and their metadata updating, and then importing to Specify (and then removing from the N drive). Need a workflow for this.

To start with, it would be good to be able to identify where all of the images for the dung beetle project are located (which folders exactly - ideally as a list), to read the barcodes and match them with the image guids, look at the images and see what kind of processing is definitely needed, check the data associated with these in Specify to see if this data set has any complications like Multi-Object Specimens, Multi-Specimen Objects etc.

AstridBVW commented 1 month ago

The mock up for the adjustments to the Specify UI was finished last week and shown to the collection managers yesterday. They approved the mock up, and it has now been implemented in Specify live. The following adjustments have been made:

Locality UI

Added "Status" as a read-only picklist next to Locality Name (used text3), picklist values: Unknown, Validated, Not validated
Added Locality Name Alias table as a button in the top right corner, including a label
Restructuring, Geography moved up below Locality Name
Added "Locality Source" as a read-only picklist next to Geography (used shortName, ran out of available text fields), picklist value: DaSSCo transcription
Locality Details table: removed Mgrs Zone field (not used), restructured to improve it visually
Remarks and Locality Attachment (button) moved to the bottom, with separator line above
Locality Citations table removed (not used)
Plugins moved below GeoCoord Details table
Geo Coord Details table: Named Place Ext removed (not used), restructured to improve it visually
Record Metadata separator removed to make it less cluttered
GUID moved down to be on the same row as Created and Modified

Collecting Information UI

Added Verbatim Locality (Specify datamodel text field) at the bottom
Added Verbatim Locality Source (used reservedText1) as a read-only picklist, picklist values: Specimen label, Register sheet

AstridBVW commented 1 month ago

We need to figure out what to do with the images, as the image pipeline is not up and running and Specify is not yet connected to the Asset Registry System (the replacement Specify web asset server).

We could just upload images to Specify now, which will store the images on the current web asset server. However, they would need some processing (including barcodes to be read, cropping, downsampling etc). This would be tedious to do manually for this many specimens (c. 5,000).

I think we should store the images on the current web asset server so we can get going with the transcription. Didn’t Thomas work on the image processing before he left? I think he had scripts for reading barcodes (and associating them with the GUID in a database) and downsampling (creating JPEGs). I don’t know about cropping. Who has taken over the image processing after Thomas?

We really need an image processing script run on these images and their metadata updating, and then importing to Specify (and then removing from the N drive). Need a workflow for this.

Again, who should be working on this?

To start with, it would be good to be able to identify where all of the images for the dung beetle project are located (which folders exactly - ideally as a list), to read the barcodes and match them with the image guids, look at the images and see what kind of processing is definitely needed, check the data associated with these in Specify to see if this data set has any complications like Multi-Object Specimens, Multi-Specimen Objects etc.

I can work on identifying in which folders the dung beetle images are located and make a list. Reading the barcodes and matching with the image GUIDs, this is was Thomas was doing. Processing of the images, would be good to get some input from someone else. I can check the associated Specify records for any complications.

AstridBVW commented 1 month ago

Other updates:

I have figured out how to use the database that Thomas made for the images to create a mapping file for importing the images to Specify and associating them with their respective records. So when the images are ready for import (after processing etc.), I will be ready to import them immediately.

I am working with Jan and Anders to include the localities from Aslak to their Locality-master list. Once that is completed, I will be able to import the list to Specify (and group).

Last I checked, not all the DigiApp exports containing dung beetle records had been imported to Specify. Once they have, I will group them in a record set.

AstridBVW commented 1 month ago

@PipBrewer You previously asked me to look into if we could use the JPEGs for transcription or if we needed the TiFFs instead. I looked at several of the JPEGs for dung beetles, and I think the quality is just fine for transcription.

AstridBVW commented 1 month ago

Allison is now working on reading the barcodes for all the images from WORKPIOF002 which was used for all the dung beetles. Once that is done (some time next week), I can use the barcode/GUID database to get a list of all the folders with images of dung beetles.

AstridBVW commented 1 month ago

All the DigiApp exports containing dung beetle records have now been imported, so I have grouped them in a record set in Specify, 7252 records in total.

PipBrewer commented 1 month ago

@AstridBVW At the moment there is no-one doing inage processing. I need to recruit (at least some short term help). First I want to have a look at the images themselves and see what the situation is. Can you point me to some examples?

AstridBVW commented 1 month ago

More fields need to be added to the Specify UI, Verbatim Date and Verbatim Date Source ( #19 ). We have agreed to use the Specify datamodel verbatimDate field and one of the available text fields for Verbatim Date Source. I will get approval from the collection managers and implement this.

AstridBVW commented 1 month ago

We have agreed to put the images on the Specify web asset server as long as ARS is not ready.

At the moment there is no-one doing inage processing. I need to recruit (at least some short term help). First I want to have a look at the images themselves and see what the situation is. Can you point me to some examples?

@PipBrewer All the images of dung beetles were taken using the WORKPIOF0002 setup. Imaging started late last year but ingestion was delayed for these images so the dates on the image folders on the N-drive starts from 2024-2-19. Imaging was finished late July. You will be able to find examples in the JPEG/WORKPIOF0002 folder from folder 2024-2-19 and on. Let me know if you need me to find specific images for you.

PipBrewer commented 1 month ago

To get images of dung beetles to Specify:

[x] Remove relevant images/folders from main image store and put elsewhere on DaSSCo N drive (e.g., uploads to Specify folder, maybe with a readme in that folder explaining why): this includes jpegs and tifs and json files
[x] Get barcodes and image_taken_dates from barcode db that Allison is running for all relevant images
[x] Add other metadata from attachments in Specify based on below spreadsheet
[ ] Upload images and metadata to Specify and associate with relevant Specify records

Attachment field values for dung beetles v1 test.xlsx

AstridBVW commented 1 month ago

List of folders with dung beetle images: I am almost done with this. I found some missing images and more ( #21 ), this affects 7 specimens. Once ticket 21 is complete, I will move on with the import of the images.

Specify data complications: I have looked into this. I do not see any complications. We do not have MOS for pinned insects. We have MSO but they are unique pr. record (i.e. 1 NHMD record pr. 1 container). I did find some MSO mistakes where one container is associated with more than one record. I will make sure to fix these (EDIT: they are now fixed).

AstridBVW commented 1 month ago

An update on the import of the images and their metadata:

I looked into if the workbench could be used to import the image metadata. It looked promising since it was possible to map to the attachment metadata fields. And it is indeed possible. I tested a couple of things. First, I tried batch uploading the images first and then importing the metadata through the workbench. This did work but also created duplicate metadata for the images. Next Fedor suggested placing the images directly on the web asset server instead of importing through the Specify UI. So I tested this, first placing the image files directly in the web asset server folder on the N-drive, and then importing the metadata through the workbench. And this worked! The first test just included the jpegs, but I did a second test where I did this for both tif and json files at the same time, and that also worked. The result in Specify is both the jpeg, tif and json files attached to the appropriate specimen record, and each of them are associated with their corresponding metadata.

@PipBrewer I need some things from you. The value you have put in the metadata field "credit" is too long, the limit is 64. Could you please decide how you want to shorten it and let me know. Also, the attachment field values are only for the jpegs and tifs. Could you please provide fixed values for the json files as well?

AstridBVW commented 3 weeks ago

More fields need to be added to the Specify UI, Verbatim Date and Verbatim Date Source ( #19 ). We have agreed to use the Specify datamodel verbatimDate field and one of the available text fields for Verbatim Date Source. I will get approval from the collection managers and implement this.

Fields for Verbatim Date and Verbatim Date Source have now been added to the Specify UI for NHMD Entomology.

PipBrewer commented 2 weeks ago

@AstridBVW In response to your queries:

The json files should not be attached to the catalogue records in Specify. They should however be migrated away from the usual place on the N drive at the same time as the images.

The credit should be: "Digitised by DaSSCo for Natural History Museum Denmark" This was approved by Kim and Nikolaj today by email.

NHMDenmark / DaSSCo-Transcription

Prepare data for v1 (DaSSCo dung beetles) #12