NHMDenmark / Mass-Digitizer

Common repo for the DaSSCo team
Apache License 2.0
1 stars 0 forks source link

Early Digi app exports/imports without author name: backfill #460

Open jlegind opened 6 months ago

jlegind commented 6 months ago

What is the issue

The app will be updated with vascular plants taxon spine to includes author names. However, the already imported specimen records did not have author associated with the taxon name. We have to retroactively add the author name to records where that name is available in the specimen image. There are ~ 15.000 vascular plant records as of 19-12-2023 in Specify. The period is May 2023 to December 19. 2023. These records will eventually all have an image associated.

Description

The imported specimen records ought to have the author name along with the taxon name if available. The author is often written after the taxon name on the label and might thus be retrieved from the image associated with the specimen. Unless we have a pipeline that can read the author name, this will require a lot of manual work. The images can be identified by the associated barcode, yet that has to be read via a pipeline. When that pipeline goes operational we can go from the specimen record straight to the image and look for the author name so that can be added. Waiting for the pipeline to be ready is the only feasible way forward.

Why is it needed/relevant ?

There is the issue of homonyms and disambiguation which author names can help solve.

Estimate level of effort required.

Hard

How to approach it?

It would be preferable for the pipeline to identify the barcodes for the images and then rename those images with the barcode. This enables us to go specimen by specimen. If a pipeline can read the authors - or at least some of the authors - that would be tremendously helpful. Reading handwritten author names and adding them to the taxonomic name is yet another issue (separate though).

FedorSteeman commented 5 months ago

This will need to be fixed eventually, but not necessarily ríght now as part of the sprint towards v1.2. Will draft these and other issues with @bhsi-snm

jlegind commented 5 months ago

Suggested Solution:

A solution could be to have a spreadsheet with drop downs for each name having been imported that has more than one authored names. The drop list contains the authors applicable to that name. The process would be:

To query for the records that were already imported into Vascular plants in Specify. Match those names to the new taxonomy where author names exist. Those names existing more than twice with different author names are separated into a file Create a spreadsheet with a drop list for each name containing the alternatives for that name ("Aa filamentosa M.L.Ortiz, 1937", or "Aa filamentosa Mansf.") The solution will be developed in Python using the xlsxwriter package.

FedorSteeman commented 5 months ago

Who has the ball on this issue?

jlegind commented 5 months ago

There is a utility ready now. It is a Excel sheet with names , storage location and dropdowns for each name. I am submitting this to Bhupjit and Pip for review.

jlegind commented 5 months ago

SUPERCEDED BY https://github.com/NHMDenmark/Mass-Digitizer/blob/main/Author_backfill/read_me.md

Workflow on producing the spreadsheet:

author_util_mapping Make sure that 'project' is set to DaSSCo

FedorSteeman commented 5 months ago

I discussed with @jlegind that we'd also need to include taxon table primary key, i.e. taxon ID, so the results can be easily used to update the taxon records with the assigned author name. In doing so, we may need to be wary of any cases where the same taxon record is assigned different author names depending on the specimen in question.

jlegind commented 4 months ago

Taxon identifiers are added to the spreadsheet utility. @PipBrewer @bhsi-snm authorDropdown.xlsx Feel free to test the spreadsheet.

jlegind commented 4 months ago

How about we find time to present this to the digitizers?

chelseagraham commented 4 months ago

The digitizers have begun filling in this sheet. The corresponding GitHub ticket for the Herbarium Sheet Digitization project is https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 :)

RebekkaML commented 2 months ago

The author name sheet has been filled and https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/74 has been closed. There are, however, some remaining issues that need to be fixed before this can be imported into Specify again.

Those Issues are: https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/93 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/90 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/92 https://github.com/NHMDenmark/Herbarium-Sheets-workstation/issues/91